ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Use ksw2 to align soft-clipped ends from ungapped alignment #382

Open marcelm opened 6 months ago

marcelm commented 6 months ago

This should fix some unexpected alignments as reported in #357.

When extending a seed, we currently compute an ungapped alignment allowing softclipping and report this unchanged if the hamming distance (counting each soft-clipped base as a mismatch) is low enough.

This is problematic because there might exist a gapped alignment of the soft-clipped ends that results in an overall higher score.

This PR changes the behavior so that after computing the ungapped alignment, ksw2 is used to align each end that has been soft clipped.

I changed two reads in the phiX test dataset so that they would be soft-clipped with the old code but now result in a non-soft-clipped alignment with a single deletion, similar to the problematic read in #357.

Accuracy changes minimally (+0.006); mapping rate is unaffected. Runtime is hard to measure, maybe it’s 1-2% slower.

The addition of ksw_extend also enables addressing #377, but this is left for a separate PR.

Closes #357

CC @tprodanov

ksahlin commented 6 months ago

Excellent! Since it was a while ago I did a benchmark, I decided to run f4a8dd2 (named _kws2) against v0.12.0 (commit 6fd4c5d).

accuracy_plot.pdf time_plot.pdf memory_plot.pdf

I have also started a benchmark on biological data with SNP/indel calling, which may be more relevant. Will report back here.

ksahlin commented 6 months ago

Results for the variant analysis below.

I have a hard time forming a consensus, but overall I think it looks good. I guess you verified it also had intended behaviour in the bug reported.

I approve a merge.

Screenshot 2023-12-18 at 16 45 55

Screenshot 2023-12-18 at 16 42 35

ksahlin commented 6 months ago

Adding the time and memory usage here as well b/t v0.7.1, v0.12.0 and this branch for reference. Runtime looks mostly unchanged between v0.12.0 and this branch except for SIM150.

It's one run per experiment though, but the node I run on is cleared from any other interfering processes.

Screenshot 2023-12-18 at 21 22 27

marcelm commented 6 months ago

I had to type this off from the screenshot, so I may have made a mistake, but are you sure this PR is better than 0.12.0?

what BWA-MEM strobealign 0.12 extend-seed diff
recall SIM150 SNV 97.9 97.6 97.6 0
recall SIM150 INDEL 62.2 67.3 67.0 -0.3
recall SIM250 SNV 98.9 98.6 98.6 0
recall SIM250 INDEL 68.4 68.5 68.5 0
recall BIO150 SNV 98.9 98.4 98.4 0
recall BIO150 INDEL 90.1 89.7 90.2 +0.5
recall BIO250 SNV 97.7 97.5 97.5 0
recall BIO250 INDEL 84.1 83.0 84.1 +1.1
precision SIM150 SNV 99.6 99.4 99.4 0
precision SIM150 INDEL 72.6 72.7 72.2 -0.5
precision SIM250 SNV 99.5 99.3 99.3 0
precision SIM250 INDEL 72.8 73.0 72.8 -0.2
precision BIO150 SNV 80.5 82.5 82.5 0
precision BIO150 INDEL 62.5 63.0 62.0 -1.0
precision BIO250 SNV 76.7 77.3 77.1 -0.2
precision BIO250 INDEL 67.3 66.9 65.3 -1.6
ksahlin commented 6 months ago

Perhaps you are right. I may have made this conclusion based on:

Those are very vague motivations. I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data?

I wanted to look at differing alignments for the analysis I did above, but I didn’t find a quick way to do a diff comparison between the versions (because large sorted BAM files). Would you have time to do some sanity check of differing alignments, or better, check the alignment scores?

For the analysis I did, the BAM files for the four datasets are here /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_v0120_cmake/*.bam and /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_ksw2_cmake/*.bam

For the two simulated files, the true alignments are at /proj/snic2022-6-31/nobackup/data/HG004/*.sam {SIM150,SIM250} which enables computing score accuracy.

marcelm commented 6 months ago

To answer a couple of questions:

I guess you verified it also had intended behaviour in the bug reported.

Current strobealign reports 143=7S for the alignment in the bug report. This PR makes it report 143=1I6= while BWA-MEM reports 140=1I9=. These two have the same score of course, but the difference is that the latter is left aligned. It would be nice to also report left-aligned alignments, but let’s consider that a separate issue (it should be possible to postprocess the alignment appropriately).

Runtime looks mostly unchanged between v0.12.0 and this branch except for SIM150.

Good, let’s hope that the slowdown on SIM150 is just an outlier. Or otherwise we’ll just have to accept the tradeoff and get a speedup some other way.

I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data?

Yes, here’s the full table:


Comparing Score-based accuracy

7b3b260 Merge pull request #379 from ksahlin/optimize-kslu2 85030cf If ungapped alignment is softclipped, ksw_extend soft-clipped ends

dataset 7b3b260 85030cf difference
ecoli50-150-se 99.3834 99.4162 +0.0328
drosophila-50-se 96.3365 96.3364 -0.0001
drosophila-75-se 99.3890 99.3884 -0.0006
drosophila-100-se 99.3757 99.3761 +0.0004
drosophila-150-se 99.6204 99.6253 +0.0049
drosophila-200-se 99.6517 99.6578 +0.0061
drosophila-300-se 99.6218 99.6322 +0.0104
drosophila-500-se 99.4878 99.4992 +0.0114
maize-50-se 91.4195 91.4190 -0.0005
maize-75-se 94.3089 94.3101 +0.0012
maize-100-se 95.6757 95.6729 -0.0028
maize-150-se 97.2738 97.2835 +0.0097
maize-200-se 98.0256 98.0373 +0.0117
maize-300-se 98.4677 98.4814 +0.0137
maize-500-se 98.8132 98.8325 +0.0193
CHM13-50-se 93.1704 93.1708 +0.0004
CHM13-75-se 96.8539 96.8518 -0.0021
CHM13-100-se 97.8192 97.8209 +0.0017
CHM13-150-se 98.5407 98.5445 +0.0038
CHM13-200-se 98.6710 98.6782 +0.0072
CHM13-300-se 98.5879 98.5947 +0.0068
CHM13-500-se 98.5663 98.5738 +0.0075
rye-50-se 88.6941 88.6898 -0.0043
rye-75-se 91.8481 91.8486 +0.0005
rye-100-se 94.0778 94.0791 +0.0013
rye-150-se 96.3543 96.3630 +0.0087
rye-200-se 97.3639 97.3727 +0.0088
rye-300-se 98.0025 98.0182 +0.0157
rye-500-se 98.7715 98.7956 +0.0241
ecoli50-150-pe 99.5890 99.6164 +0.0273
drosophila-50-pe 99.5883 99.5885 +0.0002
drosophila-75-pe 99.7696 99.7697 +0.0002
drosophila-100-pe 99.7669 99.7676 +0.0007
drosophila-150-pe 99.7475 99.7512 +0.0037
drosophila-200-pe 99.7119 99.7160 +0.0041
drosophila-300-pe 99.6593 99.6668 +0.0076
drosophila-500-pe 99.5292 99.5388 +0.0096
maize-50-pe 96.5466 96.5461 -0.0004
maize-75-pe 97.9274 97.9272 -0.0001
maize-100-pe 98.6033 98.6026 -0.0008
maize-150-pe 99.1774 99.1813 +0.0039
maize-200-pe 99.3377 99.3441 +0.0064
maize-300-pe 99.4756 99.4829 +0.0073
maize-500-pe 99.3684 99.3814 +0.0131
CHM13-50-pe 98.2363 98.2363 +0.0000
CHM13-75-pe 98.7770 98.7772 +0.0002
CHM13-100-pe 99.0379 99.0382 +0.0004
CHM13-150-pe 99.2155 99.2174 +0.0019
CHM13-200-pe 99.1885 99.1927 +0.0041
CHM13-300-pe 99.1850 99.1903 +0.0054
CHM13-500-pe 99.0001 99.0100 +0.0099
rye-50-pe 94.9689 94.9691 +0.0002
rye-75-pe 96.8398 96.8385 -0.0012
rye-100-pe 97.9214 97.9195 -0.0020
rye-150-pe 98.8289 98.8330 +0.0041
rye-200-pe 99.1062 99.1108 +0.0045
rye-300-pe 99.4078 99.4156 +0.0079
rye-500-pe 99.3743 99.3908 +0.0165

Average difference se: +0.0068 Average difference pe: +0.0046


I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.

Would you have time to do some sanity check of differing alignments, or better, check the alignment scores?

I don’t have so much time left before going on vacation. I will not be able to look at your files, but may have time for some other checks.

ksahlin commented 6 months ago

These two have the same score of course, but the difference is that the latter is left aligned.

I see. Not consistently left/right aligning indels could possibly explain the lower precision. Nice! I think we even saw it in a previous evaluation using WFA2, IIRC.

I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.

I had expected that we would never end up with lower score. But I realize I don't fully understand the consequences of the new approach. If we did a 'full' (semi-global) realignment before, compared to only extending the softclipped ends in this branch, maybe the sum of the scores of the aligned 'pieces' is lower than the semi-global score?

Related: When you extract the softclipped ends to extend, I see you only take the softclipped part:

const std::string query_right = query.substr(info.query_end);

perhaps there is a benefit of taking an overlapping part of the NAM, such as kor k/2? I understand the logic of the surrounding code and the CIGAR might become more complicated. But do you think this could be a reason for the change in score?