MD tag generation incorrect?

semenko commented 1 year ago

I ran into an odd issue parsing biscuit alignments with pysam (https://github.com/pysam-developers/pysam/issues/1180) — and I wonder if the MD tags generated by biscuit are correctly inserting a 0 after deleted bases followed by mismatches.

This is described a bit in this comment: https://github.com/pysam-developers/pysam/issues/895#issuecomment-603327971

Specifically I'm running into this issue looking at this tiny bam and read A00354:758:HVTT7DSX3:3:2333:29405:16219 with MD tag 18A5A5A1C3A13^C32^CCTAA10A1C1C32

jamorrison commented 1 year ago

According to this comment in the hts-spec issue seeking to clarify the MD tag, BWA (which BISCUIT is based on) should adhere to the spec for formatting MD tags (i.e., adding 0's after deletions that are followed by mismatches). I also looked at the reads with deletions in them in your BAM and the size of the deletions match up with the number of bases printed in the MD tag. So, I don't think the deletions are the cause of the error you're seeing.

That said, if the pysam issue discussion turns up the BISCUIT MD tag is malformed, let me know and I'll figure out a fix.

jmarshall commented 1 year ago

Please see the second half of https://github.com/pysam-developers/pysam/issues/1180#issuecomment-1498186190.

It would appear that the MD values emitted by biscuit, while they are syntactically correct w.r.t. the ‘0’ separators, are not quite correct. (Or perhaps that there are ambiguity characters in the reference at these positions, which biscuit is handling in some way?)

jamorrison commented 1 year ago

Comparing the actual reference to the reference inferred by pysam in https://github.com/pysam-developers/pysam/issues/1180#issuecomment-1498186190, it looks like the difference is due to the bisulfite conversion.

   ||| (and more down the string)
   vvv
TAACCCTAACCCTACCCTAACCCTAACCCTAAC (reference)
TAATTTTAATTTTATTTTaATTTTaATTTTaAc (inferred from MD)

And in looking through the code, it appears that BISCUIT does not properly account for the C>T / G>A conversion for the MD tag.

I'll work on getting a fix put together for this.

EDIT: I previously mentioned that it seemed like the NM tag was "handled properly." This is partially correct, as described below.

jamorrison commented 1 year ago

I made a commit that now brings the BISCUIT MD tag into alignment with the hts-spec (i.e., the MD tag is the same in BISCUIT as samtools calmd). This is available on the master branch now and will be in the next release of BISCUIT. Note, that for GC-rich regions, the MD tags can be quite long (potentially up to twice as long as SEQ).

As for the NM tag, the decision was made to leave this as the number of non-cytosine-conversion mismatches, which falls under the gray area of the hts-spec (see the italicized section below):

NM:i:count

Number of differences (mismatches plus inserted and deleted bases) between the sequence and reference, counting only (case-insensitive) A, C, G and T bases in sequence and reference as potential matches, with everything else being a mismatch. Note this means that ambiguity codes in both sequence and reference that match each other, such as N in both, or compatible codes such as A and R, are still counted as mismatches. The special sequence base = will always be considered to be a match, even if the reference is ambiguous at that point. Alignment reference skips, padding, soft and hard clipping (N, P, S and H CIGAR operations) do not count as mismatches, but insertions and deletions count as one mismatch per base.

Note that historically this has been ill-defined and both data and tools exist that disagree with this definition. (emphasis added)

The NM tag that would be given by samtools calmd can be recreated by adding the NM tag value and the ZC tag value. I will update the documentation to clarify this.

huishenlab / biscuit

MD tag generation incorrect? #37