ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
142 stars 17 forks source link

`NM` tag might be not calculated correctly in some situation #294

Closed y9c closed 1 year ago

y9c commented 1 year ago

I noticed that strobealign might not calculate NM tags correctly for some special cases. I generated an example as follow:

The reference file:

>ref
CTCCAAAGATGTGATCGAGTGCTTGGTCGAGTGATGGTATTGGTGGTGAGTGATGATACTCGACCAGGGGGTCGAGTAATATGGTGGAG

The query sequence:

>qry
CTAGGTATTGGTGGTGGTCTT

And tested it use this command:

trobealign -k 8 -A 4 -O 3 -L 100 ref.fa debug.fa 2>/dev/null | samtools calmd - ./ref.fa | grep different

[bam_fillmd1] different NM for read 'qry': 17 -> 18

The error message from samtools indicates that the NM tag is not correct, at least not in a stadard way.

I count the mismatch manually, and the number is 18 rather than 17

CTA-GGTATTGGTGGTG-GTCTT
.|...|...|.|..|........
GTGATGGTATTGGTGGTGAGTGA

| for match and . for mismatch

ksahlin commented 1 year ago

I ran the above command and reproduced what you got:

@HD VN:1.6  SO:unsorted
@SQ SN:ref  LN:89
@PG ID:strobealign  PN:strobealign  VN:v0.9.0-63-g77e36e1   CL:./build/strobealign -k 8 -A 4 -O 3 -L 100 /Users/ksahlin/tmp/strobealign_bugfixes/ref.fa /Users/ksahlin/tmp/strobealign_bugfixes/query.fa 
Running in single-end mode
qry 0   ref 31  0   3M1D13M1D5M *   0   0   CTAGGTATTGGTGGTGGTCTT       NM:i:17 AS:i:239

So this is a bug from what I can see.

Besides the bug of 17 -> 18: I must say I am a bit confused by strobealign's chosen alignment since the ref and query both contain the k-mer GGTATTGGTGGTG which I assume would be matched in the best alignment, even with the customised alignment parameters. This shared 13nt segment is what finds a shared seed in the first place. If I only run with the parameters -k 8 -A 4 -L 100 (omitting -O) we get 3X13=5X which looks less degenerate. I cannot understand at the moment how lowering gap opening from 12 to 3 would cause this. In fact, I tried all -O from 1 to 12 and degenerate alignment happen with -O from 1 to 3, while an -O set between 4 to 12 finds the shared 13-mer.

@marcelm, care to look deeper into this?

marcelm commented 1 year ago

I can also reproduce the problem. It appears that the CIGAR string is incorrect: When running strobealign with --eqx, the CIGAR string that is output is 1X1=1X1D1X1=3X1=1X1=2X1=2X1D4X1=. The last operation is a match (=), but that is not correct since A is aligned with T, so the operation should be an X. The value of the NM tag is derived from the CIGAR, which explains the discrepancy.

I’ve checked which seed is used, and that is indeed GGTATTGGTGG, so the alignment seems to be computed incorrectly.

marcelm commented 1 year ago

This turns out to be a bug in SSW that has been fixed already upstream, but we were still using the old version in strobealign, see #296. With the fix, the following alignment is found:

GTGATGGTATTGGTGGTGAGTGAT ref 
X=D=D=============D==XX=
CT A-GGTATTGGTGGTG-GTCTT query

CIGAR string is 1X1=1D1=1D13=1D2=2X1= and NM is 6.

Thanks a lot for reporting!

y9c commented 1 year ago

Thank you very much for the quick fix.