samtools / htslib

C library for high-throughput sequencing data formats
Other
809 stars 446 forks source link

Expand CRAM API a bit to cope with new samtools cram_size command. #1546

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago
jkbonfield commented 1 year ago

I've made a public version of enum cram_block_method, avoiding the problems we had before with shortened enums that could clash. This matches the CRAM specs and doesn't expand upon it to describe any of the specialisations of the codecs.

I'll think on this, but will probably end up with something in htscodecs and a returned structure as you suggest. I may even add the short and long string form as elements to that struct too, or write dedicated functions. Undecided atm.

jkbonfield commented 1 year ago

Well, following advice I've moved code from the samtools cram_size code into where it fits best (htscodecs rather than htslib given it's data codec querying rather than specifically CRAM), but it's messy! Specifically samtools has no way of using htscodecs directly, so it'll require a whole raft of build system hacks. Sigh. (For now I have it copied in there so I can check if it compiles and passes CI, but it will be fixed.)

I don't really want to bless the cram_size code as a canonical enum for an expanded list of codecs numbers, because it's rather arbitrary and possibly subject to change. Putting it in htslib or htscodecs means we can't change it, unless we end up copying the whole of cram_size there and make it just fill out a kstring, but that's also wrong as we're punting one-use application code to the library.

I'm starting to think I should have left it all as it was.

jkbonfield commented 1 year ago

I resolved it by backing out the changes to htscodecs and punted them to htslib instead. Not ideal as it doesn't quite feel right, but it makes things vastly simpler for samtools and I don't think the minimal difference justifies the extra complexity with putting those functions in htscodecs.