igvteam / igv

Integrative Genomics Viewer. Fast, efficient, scalable visualization tool for genomics data and annotations
https://igv.org
MIT License
646 stars 387 forks source link

Incorrect parsing of MM/ML tags with no "." or "?" #1545

Closed Shians closed 3 months ago

Shians commented 3 months ago

As of IGV 2.17.4 03/23/2024 (apologies if old version, I don't have installation rights on this machine), the parsing for BAM modification tags seems to be incorrect for MM tags where there is no "." or "?". My interpretation of the spec is that when the modifier is missing, it is to be interpreted as ".", but the behavior seems to be that it is interpreted as "?".

image

Below is the screenshot of a subset of data from https://github.com/human-pangenomics/HPP_Year1_Assemblies which does not have a modifier (middle track), I've manually added "." to the BAM files (top track) and "?" (bottom track). The expectation is for the middle track to be identical to the top track but it is instead identical to the bottom track.

image

The 3 bam files are attached and the data is found in the region chrX:72283997-72286054.

bam_files.zip

jrobinso commented 3 months ago

Strictly speaking you are correct, but the majority of extant files with no modifier are in fact intended to be "?". The modifier has been part of the spec for some time now, can I ask what tool or pipeline is still producing files without this specified?

jrobinso commented 3 months ago

Also, do you know from details of the experiment that this file should in fact be interpreted as "."? It would be unusual.

Shians commented 3 months ago

Thanks for the very fast reply. This particular experiment definitely intended for the missing flag to indicate "?", however as I am a maintainer of a tool that needs to parse such data, I am reluctant to code against spec. I see you hit the same conundrum a few years ago https://github.com/samtools/hts-specs/issues/654, and it doesn't seem like there's a satisfactory resolution.

I don't have any examples of real data where it should be treated as ".", and I hope to never see BAM files without the flag again. I will follow the precedence of IGV as it sounds like that'll most likely produce the result users expect.

jrobinso commented 3 months ago

It's not an idea situation, but I think assuming "?" is safer than assuming ".". By assuming "?" you are not making any assumptions about the presence or not of the modification. If you assume "." you are stating, in effect, that the modification is known to be absent. Since we know that many if not most files produced before this modifier was introduced did not intend to make statements about modifications not recorded I think we have to go with the "don't know" option. No current tools should be producing files without these modifiers.