samtools / htslib

C library for high-throughput sequencing data formats
Other
789 stars 447 forks source link

Demo #1589

Closed vasudeva8 closed 1 year ago

vasudeva8 commented 1 year ago

This pull request adds a number of sample programs to demonstrate the usage of HTSLib apis. https://docs.google.com/document/d/1d6EECYkezwZWW4zWpyqGNHK4LIrcai8anJAvNbtxiVo/edit# gives a documentation of the same.

jkbonfield commented 1 year ago

I still have specifics to go through with the demos, but I decided to take a more high level overview of things. I should have started with this before, sorry.

Basically we have a few categories of tools, and their closest matching samtools equivalent where possible:

So we have a lot of tools, but not too many themes. That's good as it permits a better narrative in a tutorial or sets of related worked examples. Some of the APIs though I think are in the wrong place.

Flagstat is a read-only tool, which is good for demonstrating decoding of the basic sequence API. This may mean read_bam is irrelevant. Or perhaps read_bam belongs in the same chapter / topic as the flagstat one. Flagstat is a perfect example of the basic hts_set_opt API with HTS_OPT_NTHREADS and CRAM_OPT_REQUIRED_FIELDS. It's not appropriate for the thread pool, as we only have one thing to do.

The read/write loops could may be considered as a file format conversion tool, so we can introduce filename handling and setting format based on suffixes. All tools currently just use sam_open(..., "w") to write SAM, so we're missing useful functionality here. A read/write loop is the perfect place to bring in using a shared thread pool, instead of explicitly controlling a per-file-descriptor thread count.

Pileup is its own thing, and definitely needs its own tutorial section. I'm happy with pileup vs mpileup, but the constructor/destructor interface is an extra complexity that has no purpose in the examples. We probably need something more advanced, such as optimising away an overhead from being computed per-column to per-read. There are choices there too - caching numerics in pileup1_t->cd.i vs caching complex things needing memory allocation, which is where the constructor/destructor come in. I'd need to think of a concise example, but test/pileup_mod.c is one such use where bam_parse_basemod is called during the constructor.

The index querying can come in at any point I guess, as it applies equallty to both a read-only scenario and a read-write scenario. I'm not sure which topic is the best one to add it to, but it's OK in the read/write code.

The header API feels a bit disjoint, as well as split.c which I can't follow the intention of. Maybe these two can be merged into the same overall topic. Eg split by read-group requires processing of the header, and maybe we'd want to show how we can add new PG entries or remove other RG lines when splitting. Again this is a perfect example of where the thread pool can be useful as we're writing out several files at once.

Aux tag manipulation could do with extending. Right now it's just regurgitating the aux tag data, which isn't so helpful (and also not the ideal way to do it). Maybe we need something to summarise tag utilisation (tagstats), or to add a new tag computed from the read data. Eg adding a cg:f tag to report the CG %age for each read. Maybe also for fun gather statistics on CG content vs MQUAL. Infact the latter half of that could be a read-BAM-only while the former is a read-write loop, so perhaps it's better than the flagstat and view alternatives? Thoughts?