lemire / FastPFor

The FastPFOR C++ library: Fast integer compression
Apache License 2.0
849 stars 124 forks source link

Some of the functions leave gaps in output buffers #99

Closed pps83 closed 1 year ago

pps83 commented 1 year ago

For my own purpose I added unit tests and I got lots of failures because some of the codecs in FastPFor leave uninitialized gaps in output buffers when encoding. Namely, these are the codecs that fail in my tests: simdfastpfor128, simdfastpfor256, varintg8iu, imdgroupsimple_ringbuf

This is the test I use:

TEST(Compression, testIntegerCodec)
{
    const size_t SZ = 12345;
    const uint32_t n = 1000;
    const int diff = 10;
    std::vector<uint32_t> in(SZ, n);
    for (auto& n : in)
        n += (rand32() % (diff * 2)) - diff;
    // in is an array of random numbers in the range of [990, 1010]

    for (auto& coder : coders::allCodecs())
    {
        if (coder.name() == "Null")
            return;
        LOG("coder: %s", std::string(coder.name()).c_str());
        std::vector<uint8_t> buf8(4 * (SZ + SZ / 4), 0xaa);  // << == init to 0xaa
        std::vector<uint32_t> buf32a, buf32b;
        size_t a = coder.encode(in, buf32a);
        size_t b = coder.encode(&in[0], in.size(), buf32b);
        size_t c = coder.encode(&in[0], in.size(), &buf8[0], buf8.size());
        buf8.resize(c);
        ASSERT_EQ(a, b);
        ASSERT_EQ(a, c);
        ASSERT_EQ(buf32a, buf32b);
        ASSERT_EQ(0, memcmp(buf32a.data(), buf8.data(), c));
    }
}

Most codecs pass the test, but those that fail have this memory diff: image

as you can see, those that fail, set first 8 bytes, then 8 bytes left untouched. IMO, these codecs should be fixed to set that "untouuched" memory to something.

pps83 commented 1 year ago

Note, my coders::allCodecs() and .encode are thin wrappers over FastPForLib::IntegerCODEC

pps83 commented 1 year ago

https://github.com/lemire/FastPFor/pull/100 fixes simdfastpfor128, simdfastpfor256. However, varintg8iu, imdgroupsimple_ringbuf still have the diff issue, and the output differences are deep inside the output buffer itself.

pps83 commented 1 year ago

I did more tests, seems like varintg8iu leave uninitialized bytes at the end out the output buffer. In my test case, it's last 7 bytes. In case of simdgroupsimple_ringbuf I get whopping 1040 consecutive bytes uninitialized somewhere in the middle of the output buffer. Seems like a bug in simdgroupsimple_ringbuf.

lemire commented 1 year ago

In case of simdgroupsimple_ringbuf I get whopping 1040 consecutive bytes uninitialized somewhere in the middle of the output buffer. Seems like a bug in simdgroupsimple_ringbuf.

Have you read the detailed documentation of the codec in question ? I recommend you do before using it...

/**
 * This is an implementation of the compression algorithm SIMD-GroupSimple,
 * which was proposed in Section 4 of the following paper:
 *
 * W. X. Zhao, X. Zhang, D. Lemire, D. Shan, J. Nie, H. Yan, and J. Wen.
 * A general simd-based approach to accelerating compression algorithms.
 * ACM Trans. Inf. Syst., 33(3), 2015.
 * http://arxiv.org/abs/1502.01916
 *
 * Implemented by Patrick Damme,
 * https://wwwdb.inf.tu-dresden.de/our-group/team/patrick-damme .
 *
 * We provide two variants of the compression part of the algorithm.
 *
 * The original variant
 * ====================
 * The first variant closely follows the original algorithm as described in the
 * paper. We also implemented the two optimizations mentioned in the paper:
 * - The calculation of pseudo quad max values instead of quad max values.
 * - One specialized (un)packing routine for each selector, whereby the
 *   appropriate routine is selected by a switch-case-statement.
 * However, our implementation differs from the paper in some minor points, for
 * instance, we directly look up the mask used in the pattern selection
 * algorithm instead of calculating it from a looked up bit width.
 *
 * The variant using a quad max ring buffer
 * ========================================
 * The second variant is based on the original description, but uses a ring
 * buffer instead of an array for the (pseudo) quad max values to reduce the
 * size of the temporary data during the compression. More details on this can
 * be found in Section 3.2.3 of the following paper:
 *
 * P. Damme, D. Habich, J. Hildebrandt, and W. Lehner. Lightweight data
 * compression algorithms: An experimental survey (experiments and analyses).
 * In Proceedings of the 20th International Conference on Extending Database
 * Technology, EDBT 2017.
 * http://openproceedings.org/2017/conf/edbt/paper-146.pdf
 *
 * The template parameter useRingBuf determines which variant is used:
 * - false: original variant
 * - true: the variant with the ring buffer
 * Both variants use the same packing routines and the same decompression
 * algorithm. Our experiments suggest that the variant with the ring buffer is
 * faster than the original algorithm for small bit widths.
 *
 * Compressed data format
 * ======================
 * As described in the original paper, the compressed data consists of two
 * areas, whose separation is a crucial point of the algorithm: the selectors
 * area and the data area, which we store in this order. The original variant
 * generates all selectors before it compresses the blocks. Thus, it knows the
 * size of the selectors area before it starts writing to the data area. So it
 * can start the data area directly after the selectors area. However, the
 * variant using the ring buffer compresses a block immediately when it
 * determines the selector. Thus, it does not know the size of the selectors
 * area before it starts writing to the data area. To prevent the two areas
 * from overlapping, we need to leave a "pessimistic gap" between them, i.e.,
 * we reserve the worst-case number of bytes for the selectors area. This has
 * no impact on the amount of data that actually needs to be written during the
 * compression or read during the decompression. However, the compression rates
 * obtained with this approach might be considered misleading, since it could
 * be argued that the unused gap should not be counted (but it needs to be
 * added to nvalue). A second boolean template parameter, pessimisticGap, lets
 * you decide how to handle this issue:
 * - false: There will be no gap between the selectors area and the data area
 *          (except for a small SIMD-padding). The reported compression rates
 *          will be correct. For the original variant, this does not cause any
 *          overhead. For the variant with the ring buffer, this requires to
 *          copy the whole data area directly behind the selectors area, which
 *          means a runtime overhead.
 * - true: There will be a pessimistic gap between the two areas. The reported
 *         compression rates will be misleading, unless we really have the
 *         worst case (each input quad contains at least one value of more than
 *         16 bits). This causes no run time overhead, neither for the original
 *         nor for the variant using the ring buffer.
 * To sum up: For maximum performance use SIMDGroupSimple<false, false> or
 * SIMDGroupSimple<true, true>; to verify that the two variants really produce
 * the same data, use the same value for pessimisticGap.

You are using the version with pessimistic gap set to true. As the name suggests, and as other comments make clear, this can leave large gaps.

It does not look like a bug.

pps83 commented 1 year ago

Ah, so, simdgroupsimple_ringbuf is the same as simdgroupsimple except that it has that pessimistic gap in the middle?

lemire commented 1 year ago

@pps83 No. But just wait I am preparing a PR.

lemire commented 1 year ago

Please see https://github.com/lemire/FastPFor/pull/101

We change just turn off the pessimistic gap feature. It will slow down construction a bit. That's what I propose doing.

pps83 commented 1 year ago

I pulled your PR locally to verify. I still get one case with a one byte gap in simdgroupsimple when I use 100 instead of 1000 to fill input buffer.

lemire commented 1 year ago

I still get one case with a one byte gap in simdgroupsimple when I use 100 instead of 1000 to fill input buffer.

Thanks for the review. Please consider producing a pull request to trim the byte in question.