Closed jkbonfield closed 1 year ago
I've made a public version of enum cram_block_method
, avoiding the problems we had before with shortened enums that could clash. This matches the CRAM specs and doesn't expand upon it to describe any of the specialisations of the codecs.
I'll think on this, but will probably end up with something in htscodecs and a returned structure as you suggest. I may even add the short and long string form as elements to that struct too, or write dedicated functions. Undecided atm.
Well, following advice I've moved code from the samtools cram_size code into where it fits best (htscodecs rather than htslib given it's data codec querying rather than specifically CRAM), but it's messy! Specifically samtools has no way of using htscodecs directly, so it'll require a whole raft of build system hacks. Sigh. (For now I have it copied in there so I can check if it compiles and passes CI, but it will be fixed.)
I don't really want to bless the cram_size code as a canonical enum for an expanded list of codecs numbers, because it's rather arbitrary and possibly subject to change. Putting it in htslib or htscodecs means we can't change it, unless we end up copying the whole of cram_size there and make it just fill out a kstring, but that's also wrong as we're punting one-use application code to the library.
I'm starting to think I should have left it all as it was.
I resolved it by backing out the changes to htscodecs and punted them to htslib instead. Not ideal as it doesn't quite feel right, but it makes things vastly simpler for samtools and I don't think the minimal difference justifies the extra complexity with putting those functions in htscodecs.
cram_block_get_content_id is updated to return -1 for CORE block, removing the need to add a content_type API and share that enum.
New cram_block_get_method method
New cram_codec_get_content_ids, which externalises the existing cram_codec_to_id (with a hopefully better API and name).
Make cram_decode_compression_header and cram_free_compression_header external. The struct itself is opaque.
Added cram_update_cid2ds_map, cram_cid2ds_query and cram_cid2ds_free to produce, query and free a block Content-ID to Data-Series mapping from the meta-data held within a CRAM Compression Header.
Added a new cram_codec iterator to sequentially step through all cram_codec structures produced by decoding a compression header. This new code is entirely internal to htslib and is solely used by the new cram_update_cid2ds_map function.
Add a codec->describe method. This produces a text description of codec parameters, such as
EXTERNAL(id=1) HUFFMAN(codes={5,1,3},lengths={1,2,2}) BYTE_ARRAY_LEN(len_codec={EXTERNAL(id=42)},val_codec={EXTERNAL(id=37)}
Adds the external API: int cram_codec_describe(cram_codec c, kstring_t ks) TODO: Some of the proposed CRAM 4.0 codecs don't yet have descriptions, but these aren't official codecs anyway.
Add a function to describe all encodings in a compression header block: int cram_describe_encodings(cram_block_compression_hdr hdr, kstring_t ks);
Add cram_container_get_num_records and cram_container_get_num_bases functions. These are two new external APIs for querying sequence meta-data from containers, which in theory could be used for a (non-filtering) "view -c" cmd.