parallel implementations and clustering of reads

darked89 commented 1 year ago

Hello,

there are mature, included in the major Linux distros programs such as pigz, pbizp2 and lbzip2. These compress faster while being on par on compression ratios. The last one (lbzip2) seems to be faster with decompression.

For achieving better compression ratios it does help to cluster reads based on the sequence with clumpify from BBMap. This also tends to speed up a bit with downstream mapping.

Hope it helps

DK

mbhall88 commented 1 year ago

Hi @darked89.

Thanks for the suggestions. I did contemplate adding in the parallel compressors you mentioned, but figured I would just stick to the standard single-threaded implementations for simplicity. (zstd also has a multi-threading option.) Reading the pbzip2 docs it seems it only does decompression? I will have a think about adding in a section on parallel (de)compression (really just a matter of whether I get the time).

Regarding clustering the reads, I'm sure you're correct, and that is very interesting, but again, this adds to the complexity of compression and I wanted this benchmark to be for the "standard" user/scenario. The other thing that would need to be accounted for in compression rates etc. would be the time taken to cluster reads.

Thanks again.

lpsantil commented 2 months ago

pbzip2 does compression as well. It's very fast in my recent adoption of it.

Have you looked into lz4 (https://github.com/lz4/lz4) and LZHAM (https://github.com/richgel999/lzham_codec)? While lz4 will achieve compression ratios inline with the lower levels of gzip or just below, it will also achieve decompression speeds in the GB/s range. Usually within the same order of magnitude of memory copy speed (https://github.com/lz4/lz4/tree/dev?tab=readme-ov-file#benchmarks).

mbhall88 / compression_benchmark

parallel implementations and clustering of reads #1