Create base_mod.c from sam.c. Sam.c was the largest file in the main htslib dir, and splitting out the base modification code means sam_mod.c is 21st of 36 files, so it's plenty big enough to justify it. Headers haven't changed, with API in htslib/sam.h as before.
Bug fix the MM parsing code. It wasn't reverting back implicit, so gave incorrect results on explicit vs implicit on mixed data types within a single alignment.
Added a bam_mods_queryi API to go along side the existing bam_mods_query_type one. It returns data on the i^th modification type instead of looking up by 'code'.
Added bam_parse_basemod2 API with an additional flag field. The only flag at present is HTS_MOD_REPORT_UNCHECKED
Modified bam_next_basemod and bam_mods_at_next_pos (and implicitly bam_mods_at_qpos) to honour the new flag.
Added some internal documentation to sam_mods.c. Long term this may be best going somewhere else, but it could do with more tidying up still.
The purpose of the flag is to allow us to distinguish the three states of modified, unmodified, and unchecked.
Imagine a scenario of a processing a pileup where we're trying to gather statistics on what percentage of data in the pileup is modified. If all sequences are using implicit mode where every base is checked and we record the likely modifications in MM and (implicitly) the unrecorded bases have been deemed to be unmodified, then the statistics are a simple binary yes/no.
If we only have explicit-mode data present, where the mod-caller checked some places only and the QUAL field is used to distinguish between likely modified and unlikely modified, and uncalled positions have not been checked, then we need to count differently. We gather statistics only on the called bases and make a binary split on QUAL instead. Uncalled bases (those not in the MM tag) have no information on modification status so are ignored.
However if we have a mix of the two, we get in a pickle. We have to start using the bam_mods_queryi / bam_mods_query_type API for each call list to determine whether the absence of a mod at a particular site means no-mod-detected or did-not-look.
With the HTS_MOD_REPORT_UNCHECKED flag set, the explicit-mode will produce a modification call even on the sites that aren't listed, but with a specific qual of HTS_MOD_UNCHECKED. This means code can handle implicit, explicit and mixed data with a single loop without needing to do additional queries. No mod or low qual = not found. Mod with high qual=found. Mod witth unchecked qual => ignore for stats.
Several improvements.
implicit
, so gave incorrect results on explicit vs implicit on mixed data types within a single alignment.bam_mods_queryi
API to go along side the existingbam_mods_query_type
one. It returns data on the i^th modification type instead of looking up by 'code'.bam_parse_basemod2
API with an additional flag field. The only flag at present isHTS_MOD_REPORT_UNCHECKED
bam_next_basemod
andbam_mods_at_next_pos
(and implicitlybam_mods_at_qpos
) to honour the new flag.The purpose of the flag is to allow us to distinguish the three states of modified, unmodified, and unchecked.
Imagine a scenario of a processing a pileup where we're trying to gather statistics on what percentage of data in the pileup is modified. If all sequences are using implicit mode where every base is checked and we record the likely modifications in MM and (implicitly) the unrecorded bases have been deemed to be unmodified, then the statistics are a simple binary yes/no.
If we only have explicit-mode data present, where the mod-caller checked some places only and the QUAL field is used to distinguish between likely modified and unlikely modified, and uncalled positions have not been checked, then we need to count differently. We gather statistics only on the called bases and make a binary split on QUAL instead. Uncalled bases (those not in the MM tag) have no information on modification status so are ignored.
However if we have a mix of the two, we get in a pickle. We have to start using the
bam_mods_queryi
/bam_mods_query_type
API for each call list to determine whether the absence of a mod at a particular site means no-mod-detected or did-not-look.With the
HTS_MOD_REPORT_UNCHECKED
flag set, the explicit-mode will produce a modification call even on the sites that aren't listed, but with a specific qual ofHTS_MOD_UNCHECKED
. This means code can handle implicit, explicit and mixed data with a single loop without needing to do additional queries. No mod or low qual = not found. Mod with high qual=found. Mod witth unchecked qual => ignore for stats.Fixes #1550 hopefully!