Open darked89 opened 1 year ago
Hi @darked89.
Thanks for the suggestions. I did contemplate adding in the parallel compressors you mentioned, but figured I would just stick to the standard single-threaded implementations for simplicity. (zstd
also has a multi-threading option.) Reading the pbzip2
docs it seems it only does decompression? I will have a think about adding in a section on parallel (de)compression (really just a matter of whether I get the time).
Regarding clustering the reads, I'm sure you're correct, and that is very interesting, but again, this adds to the complexity of compression and I wanted this benchmark to be for the "standard" user/scenario. The other thing that would need to be accounted for in compression rates etc. would be the time taken to cluster reads.
Thanks again.
pbzip2
does compression as well. It's very fast in my recent adoption of it.
Have you looked into lz4
(https://github.com/lz4/lz4) and LZHAM (https://github.com/richgel999/lzham_codec)? While lz4
will achieve compression ratios inline with the lower levels of gzip
or just below, it will also achieve decompression speeds in the GB/s range. Usually within the same order of magnitude of memory copy speed (https://github.com/lz4/lz4/tree/dev?tab=readme-ov-file#benchmarks).
Hello,
there are mature, included in the major Linux distros programs such as pigz, pbizp2 and lbzip2. These compress faster while being on par on compression ratios. The last one (lbzip2) seems to be faster with decompression.
For achieving better compression ratios it does help to cluster reads based on the sequence with clumpify from BBMap. This also tends to speed up a bit with downstream mapping.
Hope it helps
DK