ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
141 stars 17 forks source link

Imperfect read alignment, changing parameters does not help #357

Open tprodanov opened 11 months ago

tprodanov commented 11 months ago

Mapping a read

>A00297:159:HV2LVDSXX:3:1439:25572:32503
AGGCGGCCCCTGCAGTGGTCGCTGATGCAGGCGTTGAAGAATGGGCCCGGGGGCACAAGGTTGTGGCACTCAGCAAAGACCCTGTGGGGCAAAGCGATCAGGACGGAGGGCCCCCAGCGTGCATGCGGGCGGGCGGGGGCGCCCCAGGGC

to the reference

>ref
CCCATGCCCACCACAGCCGCTCTGTGATCTGATGCTGAGCCAGTGAGTCCTCCCCTCGGG
GGTTGCAGGCCCTGGGGCGCCCCCGCCCGCCCGCATGCACGCTGGGGGCCCTCCGTCCTG
ATCGCTTTGCCCCACAGGGTCTTTGCTGAGTGCCACAACCTTGTGCCCCCGGGCCCATTC
TTCAACGCCTGCATCAGCGACCACTGCAGGGCCGCCTTGAGGTGCCCTGCCAGAGCCTGA
GGCTTACGCAGAGCTCTGCCGCGCCCGGGGAGTGTGCAGTGACTGGCGAGGTGCAACCGG
TGGCCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGACCCCTCGGCCTCTCTGAGT
GTCCTGTGCGGTGAGTGGGGGCGGCCCCGGGCCCCCCAGAC

using Strobealign v0.11.0 produces alignment 143=7S with score 296. Mapping the same read using BWA MEM finds a better alignment 140=1I9= with score 143 (x2 = 286). Setting high penalty for soft clipping -L 1000 produces a bad alignment 143=1X1=2X1=1X1= with score 2260, for some reason. Decreasing gap opening (-O) penalty and increasing match bonus/mismatch penalty (-A/B) did not help. I tried changing seeding arguments -m/k/c/u/l, but with no luck for now.

Maybe part of the strobemer aligns in place of the 1 bp insert and this breaks the rest of the alignment, but it is strange that alignment cannot be fixed by modifying seeding arguments. Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.

ksahlin commented 11 months ago

Excellent bug report, I can reproduce your results.

I will discuss with @marcelm. Thanks for letting us know!

marcelm commented 11 months ago

Also, extremely high score with high soft clipping penalty without the actual soft clipping seems like a bug.

The soft clipping penalty is not actually a penalty, but an "end bonus". It is added to the score for each end that reaches the end. This is also why you get the score 2260 when you set -L 1000.

(I know this is not documented correctly, and we need to fix this. – And maybe the better fix would be to make it actually be a soft-clipping penalty.)

I believe the issue may be that the alignment library we are using (SSW) does not directly support soft clipping penalties. We just let it find an alignment that is allowed to start and end anywhere within the read, and then afterwards try to extend it towards the end of the reads and adjust the score accordingly.

When we run SSW, we get 143=7S with a score of 286 (143 times match score). The end bonus for the 5' end is then added afterwards and gives a total score of 296.

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches). So although the total score would be 305 because we can add the end bonus twice, that variant is never considered.

ksahlin commented 11 months ago

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches).

Shouldn’t lowering -O adress the issue then, given your explanation? (I tried lowering -O and it still produces 143=7S).

(mentioned in anther discussion: I haven’t thought it through properly yet, but is it sound that the end bonus/penalty is part in determining the best location for PE-reads?)

drtconway commented 10 months ago

Reflecting some direct emails, there seem to be some examples where the alignments have slightly odd indel behaviour, where bwa is happy to soft clip.

I made a recipe for reproducing some cases at https://gist.github.com/drtconway/962697804acd1d89c753b1d9a0378fc6

marcelm commented 10 months ago

I believe to have found the problem.

First, a correction to what I wrote:

SSW does not find 140=1I9= because it would result in a score of 280 - 12 - 1 + 18 = 285 (140 matches, 1 gap open, 1 gap extend, 9 matches)

Unlike BWA, our gap costs are computed as gap_open + (length - 1) * gap_extend, so the single insertion costs 12, not 12 + 1, making the total score 286. My argument would still hold (both alignments are equally good, so SSW can arbitrarily decide which one to pick), but the actual issue is a different one.

The actual problem is that SSW is never called due to an optimization in strobealign that skips computing a gapped alignment if the ungapped alignment looks "good enough". This is a part of the relevant code:

https://github.com/ksahlin/strobealign/blob/7fe293414e6ded849d654ba4758b415be26ffc93/src/aln.cpp#L226-L229

To reduce runtime, strobealign first computes an ungapped alignment when extending a seed. At this stage, soft clipping is not allowed and no CIGAR string is computed, we only count the number of mismatches. If that is below 5% of the length of the query, we skip computing a full gapped alignment with SSW.

Here, the ungapped alignment is 143= 1X 1= 2X 1= 1X 1= with four mismatches. Since 4/150=0.03, we skip SSW.

When SSW has been skipped, the ungapped alignment is re-done to get the CIGAR, this time taking soft-clipping into account. That is why the final output is 143=7S.

I was able to get the 140=1I9= alignment by 1) disabling the optimization and 2) reducing gap open penalty to 11 (-O 11).

ksahlin commented 10 months ago

Perfect that you could pin this down. Easy to fix, but requires a bit of though how to do this without introducing too much runtime.

Perhaps hamming only over the span of the NAMs, but it may lead to a lot of extra ksw calls. We can discuss this offline unless you already have something in mind.

marcelm commented 10 months ago

(I wrote the above yesterday, I notice only now I haven’t actually posted it.)

It’s clear that just disabling the optimization is undesirable. On a test dataset (drosophila-50), runtime increases by more than 50%.

I think it makes sense to first ensure that the initial ungapped alignment is also done with soft clipping. The only reason for the current, inconsistent behavior is that I forgot this when I added soft clipping.

With that in place, we would get 143=7S as the first ungapped alignment in the example. If we count the soft-clipped bases as mismatches, that would give us 7/150 = 4.7%. That is still below the currently allowed threshold of 5%, so we would still skip SSW, but it’s a bit closer. Lowering the threshold to 4% (for example) is IMO not the way to go as it doesn’t help for longer reads.

I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast.

ksahlin commented 10 months ago

I wonder whether we should then just run ksw on the ends that the ungapped alignment soft-clips. ksw can do extension alignment (unlike SSW) and because the soft-clipped ends are typically short, it should be quite fast.

Sounds good. So we need to start around _k_nt in w.r.t. the softclipp. So for the above example we would do extension alignment with the k+7 rightmost nt in the read to detect the indel.

ksahlin commented 10 months ago

Btw, what does strobealign currently do when the region with the NAM has the same length on the query and the reference and Hamming distance of the NAM region is high?

Do we fully realign such cases with SSW? If so, an optimization would be to run ksw on the ends only. I remember we have discussed similar scenarios when we tried out partitioning the alignments and use WFA2 but I don't remember the conclusions.

Clarification: I meant when Hamming distance is high -- possibly because regions outside the NAM region do not fit (e.g. indels). Then it might be inefficient to realign the whole read. One approach would be to try hamming of the NAM hit only, then extension of the ends.