Open alshai opened 8 months ago
I'm not sure I have a full explanation for why this is happening, but one thing that I would recommend is dropping the base qualities on the Ns. If an N comes off of a sequencing machine, the base is usually given a base quality of #
, whereas you've given them a near-perfect base quality. If this still occurs with more realistic base qualities, it would also help to have the full alignments to look at (the path
field in the JSON).
With one N, you still get full-length alignments from gapless extension, because the full-length bonus is larger than the penalty for a single mismatch. With two or more Ns, the trailing Ns get trimmed, and Giraffe does tail alignment with Dozeu. What does Dozeu do when the read and/or the graph contain Ns? (Gapless extension masks all non-ACGT characters in the read with X, assuming that the graph does not contain character X.)
I believe that Dozeu will treat all Ns as 0 score mismatches (even if both read and reference are N). The full length bonus is not applied during dynamic programming, but it is applied when choosing the optimal traceback location and in the final score.
I think I know what is happening. Let's concentrate on the read with two trailing Ns.
There are three equally good alignments in the graph, with the problematic one being path >45>47>48>49>50
in the following subgraph:
H VN:Z:1.1 RS:Z:GRCh38
S 44686345 AGGATCGCTTGAGGCCAGGAGGCGGATGTTGCAGTGAGCCATGATCAT
S 44686346 A
S 44686347 G
S 44686348 ACACTGTA
S 44686349 C
S 44686350 TCCAGCCTGGGTGACAGAGCGAAATTCCAGCAAAGAAAGGGAGAGAGGAAGGGAGAGAGGGAGTAGGGGAGAGAGAGAGAGAGAG
L 44686345 + 44686346 + 0M
L 44686345 + 44686347 + 0M
L 44686346 + 44686348 + 0M
L 44686347 + 44686348 + 0M
L 44686348 + 44686349 + 0M
L 44686349 + 44686350 + 0M
W GRCh38 0 chr5 176090011 176090154 >44686345>44686346>44686348>44686349>44686350 WT:i:51
W unknown 1 chr5 0 57 >44686345>44686347>44686348 WT:i:1
W unknown 2 chr5 0 143 >44686345>44686347>44686348>44686349>44686350 WT:i:36
W unknown 3 chr5 0 85 >44686350 WT:i:1
The first (internal) N corresponds to node 49. Haplotype unknown 1
ends at node 48, because it would have a rare allele at the position of node 49. Meanwhile, haplotype unknown 2
would be the correct alignment. We get one right-maximal extension with no mismatches over unknown 1
and another with 10 additional matches, 4 additional mismatches (all Ns), and a full-length bonus over unknown 2
. Because the extension over unknown 1
scores higher, we select it as the best extension.
The best extension has an unaligned 14 bp tail, which we align using Dozeu. Because Dozeu does not use the mismatch penalty for Ns, this alignment has a higher score than the equivalent alignments elsewhere in the graph.
A couple of questions:
Do the reads automatically get trimmed if there are at least 2 Ns. Or is it that in this example the 1-N read just happens to be small enough to get the full-length bonus?
Do the Ns receive a penalty during the right-maximal extension?
Is there anyway we can tell Dozeu to use a non-0 penalty for Ns?
Giraffe uses two alignment algorithms. First it tries to find an alignment without indels. If it can't find a full-length alignment with a few mismatches, it trims the partial alignments it found, as mismatches near the ends of an alignment often indicate the presence of indels. Then it aligns the tails of the trimmed alignment while allowing indels.
In the first (gapless extension) algorithm, Ns are mismatches with a normal mismatch penalty against everything. In the second algorithm (Dozeu), Ns are zero-cost mismatches against everything. This is effectively a rare bug that requires a complex fix, because the above behaviors are inherent to the way the alignment algorithms have been implemented.
1. What were you trying to do?
Doing some experiments where I am aligning the following reads to the HPRC v1.1 pangenome index:
r1_wo_n
r1_cut
r1_w_1n
r1_w_2n
r1_w_n
Here's the FASTQ file:
2. What did you want to happen?
I expected the the reads with tails of Ns to have a similar alignment score and mapping quality to the trimmed read. And also expected the reads with Ns to have worse mapping quality than the full read.
3. What actually happened?
Reads with at least 2 Ns have a better alignment score than the trimmed read, and subsequently a better mapping quality than both the trimmed read and the full read. Oddly, the read with 2 Ns (
r1_w_2n
) has an even higher mapping quality than the read with the full amount of Ns (r1_w_n
). Here's a filtered version of the GAM output. It's clear that the higher mapping quality results from the reads with Ns having higher-than-expected primary alignment scores.5. What data and command can the vg dev team use to make the problem happen?
I used a re-tagged version of HPRC v1.1 (specified GRCh38 as the only reference). Hopefully this should have the same behavior as the original HPRC v1.1 index. Also enabled reporting of secondary alignments.
6. What does running
vg version
say?