samtools / htslib

C library for high-throughput sequencing data formats
Other
785 stars 447 forks source link

Base mod API improvements #1636

Closed jkbonfield closed 1 year ago

jkbonfield commented 1 year ago

Several improvements.

The purpose of the flag is to allow us to distinguish the three states of modified, unmodified, and unchecked.

Imagine a scenario of a processing a pileup where we're trying to gather statistics on what percentage of data in the pileup is modified. If all sequences are using implicit mode where every base is checked and we record the likely modifications in MM and (implicitly) the unrecorded bases have been deemed to be unmodified, then the statistics are a simple binary yes/no.

If we only have explicit-mode data present, where the mod-caller checked some places only and the QUAL field is used to distinguish between likely modified and unlikely modified, and uncalled positions have not been checked, then we need to count differently. We gather statistics only on the called bases and make a binary split on QUAL instead. Uncalled bases (those not in the MM tag) have no information on modification status so are ignored.

However if we have a mix of the two, we get in a pickle. We have to start using the bam_mods_queryi / bam_mods_query_type API for each call list to determine whether the absence of a mod at a particular site means no-mod-detected or did-not-look.

With the HTS_MOD_REPORT_UNCHECKED flag set, the explicit-mode will produce a modification call even on the sites that aren't listed, but with a specific qual of HTS_MOD_UNCHECKED. This means code can handle implicit, explicit and mixed data with a single loop without needing to do additional queries. No mod or low qual = not found. Mod with high qual=found. Mod witth unchecked qual => ignore for stats.

Fixes #1550 hopefully!