Closed pps83 closed 1 year ago
Note, my coders::allCodecs()
and .encode
are thin wrappers over FastPForLib::IntegerCODEC
https://github.com/lemire/FastPFor/pull/100 fixes simdfastpfor128, simdfastpfor256. However, varintg8iu, imdgroupsimple_ringbuf still have the diff issue, and the output differences are deep inside the output buffer itself.
I did more tests, seems like varintg8iu
leave uninitialized bytes at the end out the output buffer. In my test case, it's last 7 bytes.
In case of simdgroupsimple_ringbuf
I get whopping 1040 consecutive bytes uninitialized somewhere in the middle of the output buffer. Seems like a bug in simdgroupsimple_ringbuf
.
In case of simdgroupsimple_ringbuf I get whopping 1040 consecutive bytes uninitialized somewhere in the middle of the output buffer. Seems like a bug in simdgroupsimple_ringbuf.
Have you read the detailed documentation of the codec in question ? I recommend you do before using it...
/**
* This is an implementation of the compression algorithm SIMD-GroupSimple,
* which was proposed in Section 4 of the following paper:
*
* W. X. Zhao, X. Zhang, D. Lemire, D. Shan, J. Nie, H. Yan, and J. Wen.
* A general simd-based approach to accelerating compression algorithms.
* ACM Trans. Inf. Syst., 33(3), 2015.
* http://arxiv.org/abs/1502.01916
*
* Implemented by Patrick Damme,
* https://wwwdb.inf.tu-dresden.de/our-group/team/patrick-damme .
*
* We provide two variants of the compression part of the algorithm.
*
* The original variant
* ====================
* The first variant closely follows the original algorithm as described in the
* paper. We also implemented the two optimizations mentioned in the paper:
* - The calculation of pseudo quad max values instead of quad max values.
* - One specialized (un)packing routine for each selector, whereby the
* appropriate routine is selected by a switch-case-statement.
* However, our implementation differs from the paper in some minor points, for
* instance, we directly look up the mask used in the pattern selection
* algorithm instead of calculating it from a looked up bit width.
*
* The variant using a quad max ring buffer
* ========================================
* The second variant is based on the original description, but uses a ring
* buffer instead of an array for the (pseudo) quad max values to reduce the
* size of the temporary data during the compression. More details on this can
* be found in Section 3.2.3 of the following paper:
*
* P. Damme, D. Habich, J. Hildebrandt, and W. Lehner. Lightweight data
* compression algorithms: An experimental survey (experiments and analyses).
* In Proceedings of the 20th International Conference on Extending Database
* Technology, EDBT 2017.
* http://openproceedings.org/2017/conf/edbt/paper-146.pdf
*
* The template parameter useRingBuf determines which variant is used:
* - false: original variant
* - true: the variant with the ring buffer
* Both variants use the same packing routines and the same decompression
* algorithm. Our experiments suggest that the variant with the ring buffer is
* faster than the original algorithm for small bit widths.
*
* Compressed data format
* ======================
* As described in the original paper, the compressed data consists of two
* areas, whose separation is a crucial point of the algorithm: the selectors
* area and the data area, which we store in this order. The original variant
* generates all selectors before it compresses the blocks. Thus, it knows the
* size of the selectors area before it starts writing to the data area. So it
* can start the data area directly after the selectors area. However, the
* variant using the ring buffer compresses a block immediately when it
* determines the selector. Thus, it does not know the size of the selectors
* area before it starts writing to the data area. To prevent the two areas
* from overlapping, we need to leave a "pessimistic gap" between them, i.e.,
* we reserve the worst-case number of bytes for the selectors area. This has
* no impact on the amount of data that actually needs to be written during the
* compression or read during the decompression. However, the compression rates
* obtained with this approach might be considered misleading, since it could
* be argued that the unused gap should not be counted (but it needs to be
* added to nvalue). A second boolean template parameter, pessimisticGap, lets
* you decide how to handle this issue:
* - false: There will be no gap between the selectors area and the data area
* (except for a small SIMD-padding). The reported compression rates
* will be correct. For the original variant, this does not cause any
* overhead. For the variant with the ring buffer, this requires to
* copy the whole data area directly behind the selectors area, which
* means a runtime overhead.
* - true: There will be a pessimistic gap between the two areas. The reported
* compression rates will be misleading, unless we really have the
* worst case (each input quad contains at least one value of more than
* 16 bits). This causes no run time overhead, neither for the original
* nor for the variant using the ring buffer.
* To sum up: For maximum performance use SIMDGroupSimple<false, false> or
* SIMDGroupSimple<true, true>; to verify that the two variants really produce
* the same data, use the same value for pessimisticGap.
You are using the version with pessimistic gap
set to true
. As the name suggests, and as other comments make clear, this can leave large gaps.
It does not look like a bug.
Ah, so, simdgroupsimple_ringbuf
is the same as simdgroupsimple
except that it has that pessimistic gap
in the middle?
@pps83 No. But just wait I am preparing a PR.
Please see https://github.com/lemire/FastPFor/pull/101
We change just turn off the pessimistic gap feature. It will slow down construction a bit. That's what I propose doing.
I pulled your PR locally to verify. I still get one case with a one byte gap in simdgroupsimple
when I use 100 instead of 1000 to fill input buffer.
I still get one case with a one byte gap in simdgroupsimple when I use 100 instead of 1000 to fill input buffer.
Thanks for the review. Please consider producing a pull request to trim the byte in question.
For my own purpose I added unit tests and I got lots of failures because some of the codecs in FastPFor leave uninitialized gaps in output buffers when encoding. Namely, these are the codecs that fail in my tests: simdfastpfor128, simdfastpfor256, varintg8iu, imdgroupsimple_ringbuf
This is the test I use:
Most codecs pass the test, but those that fail have this memory diff:![image](https://user-images.githubusercontent.com/1614246/219432926-b94c5d34-e797-4ca7-a98b-00a6d7e036c2.png)
as you can see, those that fail, set first 8 bytes, then 8 bytes left untouched. IMO, these codecs should be fixed to set that "untouuched" memory to something.