Closed matty234 closed 5 months ago
Thanks for the suggestion. For now it'll probably languish, but it is something I want at some point.
As for whether it's useful. There are some cases where zstd shines. Specifically unaligned data or non-reference based encoding with very long sequences. Deflate's biggest weakness is the 32kb window limit for back references. When some sequences can be longer than 32kb you'll see this is a rather tragic limitation! This would make zstd a suitable alternative to lzma. It's not quite as small, but has a much broader range of compression levels and is usually considerably faster, particularly for decode. That said in normal CRAM it's not so helpful. Note you can use samtools cram-size
to dump out information on the encoding types per CRAM data block and you'll see not much is using gzip/deflate really at CRAM 3.1.
Zstd however really should be used instead of deflate for the bgzf format, especially as it applies then to non-reference based sequences or very long VCF lines. I started work on this a year ago in https://github.com/jkbonfield/htslib/tree/bgzf2. It's mixture of seekable zstd with pzstd for parallel I/O. It was working, but I got pulled on to other projects and never found the time to dust it off again. If we finished this and added zstd as a dependency to htslib, it would probably also give me impetus to make zstd part of CRAM 4 too.
In hindsight, a deeper understanding of Zstd would have been beneficial before I started :D
Seekable Zstd looks interesting - I noticed that it doesn't have the same header format as standard Zstd files so it might be worth adding to/adjusting https://github.com/samtools/htslib/blob/deeb9f01376ca9416315e4c9f5fe489e6f03e05f/hts.c#L583.
Thanks again for your review!
This pull request adds Zstd support to the supported CRAM block compression methods.
I thought (wrongly) that this would have a significant impact on the compression ratio and throughput and generation speed. In reality, the impact is marginal on both and this isn't improved much by 'training' on the individual block types. There was some chat in a few issues that referenced this as a helpful addition though so if anyone from there has any suggestions for further improvements please do feel free to let me know 😄
A few notes: 1) No worries if this doesn't get merged I appreciate there's a spec and this will introduce backwards incompatible changes 1) If this were to be merged, would it make sense to only include the method in the
methmap
if it is explicitly enabled (but enable decompression regardless)? 1) I haven't adjusted themeth_cost
value for Zstd which remains at 1.0 - I think this will create a bias to always select Zstd (which probably isn't right) - are the current figures based on a benchmark or just arbitrary? 1) I extendedhts.c
with some more information yanked from the Zstd header - I'm not entirely sure why I did that as I now realise that has nothing to do with the CRAM side of things. There are details regarding the decompressed size and a field that determines whether a checksum can be found at the end of the file though (which might be helpful for other non-CRAM files)Re: #530