powturbo / TurboPFor-Integer-Compression

Fastest Integer Compression
GNU General Public License v2.0
761 stars 111 forks source link

Comparison with Blosc can be improved #2

Closed FrancescAlted closed 9 years ago

FrancescAlted commented 9 years ago

I have just seen your project and I see that you are using Blosc for comparison. For completeness, I would recommend to include other Blosc codecs as they give different results that may be a good fit in different scenarios.

Below, I have selected several combinations that works well for this problem, and here are my results (Linux, GCC 4.9.2, Xeon E3-1240 v3 @ 3.40GHz):

$ time LD_LIBRARY_PATH=/home/francesc/c-blosc/build/blosc ./icbench -a1.5 -m0 -M255 -n100m
zipf alpha=1.50 range[0..255].n=100000000

bits histogram:0:40.20% 1:14.22% 2:12.75% 3:10.28% 4:7.77% 5:5.69% 6:4.09% 7:2.92% 8:2.07%
62064288    15.52    4.97      27.85          130.24   blosc (zlib, nthreads=1, clevel=3)
62064288    15.52    4.97      42.85          214.16   blosc (zlib, nthreads=4, clevel=3)
63392801    15.85    5.07     396.74         1412.46   TurboPFor
77187330    19.30    6.17      51.44          663.17   blosc (lz4hc, nthreads=1, clevel=3)
77187330    19.30    6.17      78.52          984.59   blosc (lz4hc, nthreads=4, clevel=3)
101473443    25.37    8.12     862.60          896.15  blosc (lz4, nthreads=1, clevel=3)
101473443    25.37    8.12    1193.86         1392.87  blosc (lz4, nthreads=2, clevel=3)
101473443    25.37    8.12    1608.41         1549.34  blosc (lz4, nthreads=4, clevel=3)
102491006    25.62    8.20     595.36         1815.32  blosc (blosclz, nthreads=1, clevel=3)
102491006    25.62    8.20     949.23         2049.91  blosc (blosclz,  nthreads=2, clevel=3)
102491006    25.62    8.20    1293.24         1873.03  blosc (blosclz,  nthreads=4, clevel=3)

The timings with clang 3.5 are very close to these, so I am not reproducing them.

By the way, very nice work! I did not realized that compressing integers was that important!

powturbo commented 9 years ago

Thanks for your nice comments. Lz77 needs large blocks to be efficient. Blosc + lz4 with block size 64Ki (256K bytes) are only included in the benchmark as indication. Large blocks involved, while processing queries (inverted index, search engines, databases, graphs, in memory computing,...) need to be entirely decoded. Small blocks with fixed size (ex. 128 Integers) are usually used. This permits avoiding the decompression of blocks which are not necessary for answering queries. Ex. in the "Inverted Index App" included in TurboPFor, only 10% are decompressed in query processing.

FrancescAlted commented 9 years ago

I see. However, Blosc does split the compressed buffer in blocks (between 8 KB and 1 MB, depending on the typesize and compression level, see here how this is computed: https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.c#L852) and each block can be accessed randomly (hence decompressed selectively) with the blosc_getitem() function: https://github.com/Blosc/c-blosc/blob/master/blosc/blosc.h#L226 . So I would say that this fulfills your requirements.

powturbo commented 9 years ago

Well, 8Ki (32K bytes) blosc with zlib has worse compression ratio than TurboPFor function, but more important TurboPFor compress ~12 times faster and decompress ~13 times faster than blosc. All other configurations are not competitive in terms of compression ratio or speed. Comparing apples to apples, multithreading cannot be considered, as it is a trivial task to add and normally implemented outside the compression functions. TurboPFor has also a faster transpose function (same as shuffle in blosc), delta and zigzag that can be used in conjunction with zlib, lz4 or other compressors. If you are interested, I suggest, reading the papers linked at the bottom of the TurboPFor readme file.

lemire commented 9 years ago

@powturbo To be fair, adding comparisons where your code will obviously beat the alternative cannot be harmful. (Though I understand that it can be a waste of your time.)

powturbo commented 9 years ago

Hi Daniel, no, simply I don't want the benchmark tables to be bloated without any additional info. I've considered "lz4" the best lz77 in blosc and my "delta+transpose" with "lz4" as indication. Maybe it is possible to take a separate table w/ 8Ki, 32Ki or 64Ki as block size for these sort of tests. All functions will be tested under the same conditions.

FrancescAlted commented 9 years ago

@powturbo Well, my point is that "Blosc + lz4 with block size 64K" is not representative of what Blosc is doing behind the scenes (as said, Blosc actually uses different block sizes internally). Also, I think the speed of BloscLZ (specially when decompressing) maybe is worth a mention, but this is certainly up to you.

powturbo commented 9 years ago

I felt that "blosc lz4" was the best option and "blosc lz" was the slower lzf. I've now included blosc_lz with 64Ki (=256K bytes) and a link to your benchmark table. The benchmark is repeated several times and the best timings are considered

FrancescAlted commented 9 years ago

Hmm, I am not sure why are you getting so different results for blosclz for the DocId dataset (compared with the synthetic dataset). Anyway, I am finishing the inclusion of a bitshuffle filter (like shuffle but the transpose happens at bit level) in: https://github.com/Blosc/c-blosc/pull/146, so it would a good time for trying again with bitshuffle instead.