Closed FrancescAlted closed 9 years ago
Thanks for your nice comments. Lz77 needs large blocks to be efficient. Blosc + lz4 with block size 64Ki (256K bytes) are only included in the benchmark as indication. Large blocks involved, while processing queries (inverted index, search engines, databases, graphs, in memory computing,...) need to be entirely decoded. Small blocks with fixed size (ex. 128 Integers) are usually used. This permits avoiding the decompression of blocks which are not necessary for answering queries. Ex. in the "Inverted Index App" included in TurboPFor, only 10% are decompressed in query processing.
I see. However, Blosc does split the compressed buffer in blocks (between 8 KB and 1 MB, depending on the typesize and compression level, see here how this is computed: https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.c#L852) and each block can be accessed randomly (hence decompressed selectively) with the blosc_getitem()
function: https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.h#L226 . So I would say that this fulfills your requirements.
Well, 8Ki (32K bytes) blosc with zlib has worse compression ratio than TurboPFor function, but more important TurboPFor compress ~12 times faster and decompress ~13 times faster than blosc. All other configurations are not competitive in terms of compression ratio or speed. Comparing apples to apples, multithreading cannot be considered, as it is a trivial task to add and normally implemented outside the compression functions. TurboPFor has also a faster transpose function (same as shuffle in blosc), delta and zigzag that can be used in conjunction with zlib, lz4 or other compressors. If you are interested, I suggest, reading the papers linked at the bottom of the TurboPFor readme file.
@powturbo To be fair, adding comparisons where your code will obviously beat the alternative cannot be harmful. (Though I understand that it can be a waste of your time.)
Hi Daniel, no, simply I don't want the benchmark tables to be bloated without any additional info. I've considered "lz4" the best lz77 in blosc and my "delta+transpose" with "lz4" as indication. Maybe it is possible to take a separate table w/ 8Ki, 32Ki or 64Ki as block size for these sort of tests. All functions will be tested under the same conditions.
@powturbo Well, my point is that "Blosc + lz4 with block size 64K" is not representative of what Blosc is doing behind the scenes (as said, Blosc actually uses different block sizes internally). Also, I think the speed of BloscLZ (specially when decompressing) maybe is worth a mention, but this is certainly up to you.
I felt that "blosc lz4" was the best option and "blosc lz" was the slower lzf. I've now included blosc_lz with 64Ki (=256K bytes) and a link to your benchmark table. The benchmark is repeated several times and the best timings are considered
Hmm, I am not sure why are you getting so different results for blosclz for the DocId dataset (compared with the synthetic dataset). Anyway, I am finishing the inclusion of a bitshuffle filter (like shuffle but the transpose happens at bit level) in: https://github.com/Blosc/c-blosc/pull/146, so it would a good time for trying again with bitshuffle instead.
I have just seen your project and I see that you are using Blosc for comparison. For completeness, I would recommend to include other Blosc codecs as they give different results that may be a good fit in different scenarios.
Below, I have selected several combinations that works well for this problem, and here are my results (Linux, GCC 4.9.2, Xeon E3-1240 v3 @ 3.40GHz):
The timings with clang 3.5 are very close to these, so I am not reproducing them.
By the way, very nice work! I did not realized that compressing integers was that important!