samtools / htslib

C library for high-throughput sequencing data formats
Other
803 stars 446 forks source link

Add support for CRAM Zstd compression #1770

Closed matty234 closed 5 months ago

matty234 commented 5 months ago

This pull request adds Zstd support to the supported CRAM block compression methods.

I thought (wrongly) that this would have a significant impact on the compression ratio and throughput and generation speed. In reality, the impact is marginal on both and this isn't improved much by 'training' on the individual block types. There was some chat in a few issues that referenced this as a helpful addition though so if anyone from there has any suggestions for further improvements please do feel free to let me know 😄

A few notes: 1) No worries if this doesn't get merged I appreciate there's a spec and this will introduce backwards incompatible changes 1) If this were to be merged, would it make sense to only include the method in the methmap if it is explicitly enabled (but enable decompression regardless)? 1) I haven't adjusted the meth_cost value for Zstd which remains at 1.0 - I think this will create a bias to always select Zstd (which probably isn't right) - are the current figures based on a benchmark or just arbitrary? 1) I extended hts.c with some more information yanked from the Zstd header - I'm not entirely sure why I did that as I now realise that has nothing to do with the CRAM side of things. There are details regarding the decompressed size and a field that determines whether a checksum can be found at the end of the file though (which might be helpful for other non-CRAM files)

Re: #530

jkbonfield commented 5 months ago

Thanks for the suggestion. For now it'll probably languish, but it is something I want at some point.

  1. It's perfectly conceivable we could do something like this for CRAM 4 (and deprecate deflate as it's old hat), but it wouldn't be in one of the official CRAMs (up to 3.1).
  2. Correct - methmap is the way to get it explored for candidate encodings, and decoding will just follow the instructions written in the file regardless. We could be more careful and check specifically if it's legal for that CRAM version, but if it works then fair enough.
  3. The methmap figures are slightly arbitrary, but have some semblance of real data. We can't use the actual relative costs as if someone asked to enable lzma then the chance is they really don't mind it being used where appropriate even if it is costly. So the weightings have to be some balance between speed and size. They're skewed up and down by the -1 to -9 level too.
  4. There is benefit to having zstd understood by things like htsfile. Infact I think it may already support it in terms of recognising it as a file format.

As for whether it's useful. There are some cases where zstd shines. Specifically unaligned data or non-reference based encoding with very long sequences. Deflate's biggest weakness is the 32kb window limit for back references. When some sequences can be longer than 32kb you'll see this is a rather tragic limitation! This would make zstd a suitable alternative to lzma. It's not quite as small, but has a much broader range of compression levels and is usually considerably faster, particularly for decode. That said in normal CRAM it's not so helpful. Note you can use samtools cram-size to dump out information on the encoding types per CRAM data block and you'll see not much is using gzip/deflate really at CRAM 3.1.

Zstd however really should be used instead of deflate for the bgzf format, especially as it applies then to non-reference based sequences or very long VCF lines. I started work on this a year ago in https://github.com/jkbonfield/htslib/tree/bgzf2. It's mixture of seekable zstd with pzstd for parallel I/O. It was working, but I got pulled on to other projects and never found the time to dust it off again. If we finished this and added zstd as a dependency to htslib, it would probably also give me impetus to make zstd part of CRAM 4 too.

matty234 commented 5 months ago

In hindsight, a deeper understanding of Zstd would have been beneficial before I started :D

Seekable Zstd looks interesting - I noticed that it doesn't have the same header format as standard Zstd files so it might be worth adding to/adjusting https://github.com/samtools/htslib/blob/deeb9f01376ca9416315e4c9f5fe489e6f03e05f/hts.c#L583.

Thanks again for your review!