Use ksw2 to align soft-clipped ends from ungapped alignment

marcelm commented 6 months ago

This should fix some unexpected alignments as reported in #357.

When extending a seed, we currently compute an ungapped alignment allowing softclipping and report this unchanged if the hamming distance (counting each soft-clipped base as a mismatch) is low enough.

This is problematic because there might exist a gapped alignment of the soft-clipped ends that results in an overall higher score.

This PR changes the behavior so that after computing the ungapped alignment, ksw2 is used to align each end that has been soft clipped.

The first commits that update the ksw2 source and add an Aligner::ksw_extend function were taken from the abandoned wfa2 branch.
The ksw2 functions have only very little documentation, so some experimentation was needed to understand how to correctly call them. I needed to make some fixes on top of the code in the wfa2 branch. For example, some example code for ksw2 uses a score of 0 for comparison of wildcard characters to regular nucleotides. This may make sense, but for compatibility with SSW, I changed this to be the same as a normal mismatch.
The actual behavior change is in 85030cfa81ffebfd21dadc339b4dff72176d91c0.

I changed two reads in the phiX test dataset so that they would be soft-clipped with the old code but now result in a non-soft-clipped alignment with a single deletion, similar to the problematic read in #357.

Accuracy changes minimally (+0.006); mapping rate is unaffected. Runtime is hard to measure, maybe it’s 1-2% slower.

The addition of ksw_extend also enables addressing #377, but this is left for a separate PR.

Closes #357

CC @tprodanov

ksahlin commented 6 months ago

Excellent! Since it was a while ago I did a benchmark, I decided to run f4a8dd2 (named _kws2) against v0.12.0 (commit 6fd4c5d).

Minimal difference in runtime.
The improved accuracy on 50nt reads likely comes from the parameter optimisation in a previous PR.

accuracy_plot.pdf time_plot.pdf memory_plot.pdf

I have also started a benchmark on biological data with SNP/indel calling, which may be more relevant. Will report back here.

ksahlin commented 6 months ago

Results for the variant analysis below.

BIO datasets SNV: Last versions (v0.12.0 and ksw2) are approaching BWA-MEM's values (increased recall and lower precision over previous strobealign versions)
BIO datasets INDEL: The ksw2 version has higher recall but lower precision compared to all previous versions. Importantly, it approaches or supersedes the values of BWA-MEM.
The results for the simulated datasets (SIM) look quite similar or slightly worse compared to previous versions, but again, moving towards BWA-MEM values.

I have a hard time forming a consensus, but overall I think it looks good. I guess you verified it also had intended behaviour in the bug reported.

I approve a merge.

Screenshot 2023-12-18 at 16 45 55

Screenshot 2023-12-18 at 16 42 35

ksahlin commented 6 months ago

Adding the time and memory usage here as well b/t v0.7.1, v0.12.0 and this branch for reference. Runtime looks mostly unchanged between v0.12.0 and this branch except for SIM150.

It's one run per experiment though, but the node I run on is cleared from any other interfering processes.

Screenshot 2023-12-18 at 21 22 27

marcelm commented 6 months ago

I had to type this off from the screenshot, so I may have made a mistake, but are you sure this PR is better than 0.12.0?

what	BWA-MEM	strobealign 0.12	extend-seed	diff
recall SIM150 SNV	97.9	97.6	97.6	0
recall SIM150 INDEL	62.2	67.3	67.0	-0.3
recall SIM250 SNV	98.9	98.6	98.6	0
recall SIM250 INDEL	68.4	68.5	68.5	0
recall BIO150 SNV	98.9	98.4	98.4	0
recall BIO150 INDEL	90.1	89.7	90.2	+0.5
recall BIO250 SNV	97.7	97.5	97.5	0
recall BIO250 INDEL	84.1	83.0	84.1	+1.1
precision SIM150 SNV	99.6	99.4	99.4	0
precision SIM150 INDEL	72.6	72.7	72.2	-0.5
precision SIM250 SNV	99.5	99.3	99.3	0
precision SIM250 INDEL	72.8	73.0	72.8	-0.2
precision BIO150 SNV	80.5	82.5	82.5	0
precision BIO150 INDEL	62.5	63.0	62.0	-1.0
precision BIO250 SNV	76.7	77.3	77.1	-0.2
precision BIO250 INDEL	67.3	66.9	65.3	-1.6

ksahlin commented 6 months ago

Perhaps you are right. I may have made this conclusion based on:

I don't remember the exact reason, but I don’t trust the INDEL analysis a 100% based on what we discovered in an earlier analysis (something with how overlaps and TP are computed).
We move towards BWA-MEMs higher recall lower precision values.

Those are very vague motivations. I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data?

I wanted to look at differing alignments for the analysis I did above, but I didn’t find a quick way to do a diff comparison between the versions (because large sorted BAM files). Would you have time to do some sanity check of differing alignments, or better, check the alignment scores?

For the analysis I did, the BAM files for the four datasets are here /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_v0120_cmake/*.bam and /proj/snic2022-6-31/nobackup/strobealign_eval/alignments_PE/strobealign_ksw2_cmake/*.bam

For the two simulated files, the true alignments are at /proj/snic2022-6-31/nobackup/data/HG004/*.sam {SIM150,SIM250} which enables computing score accuracy.

marcelm commented 6 months ago

To answer a couple of questions:

I guess you verified it also had intended behaviour in the bug reported.

Current strobealign reports 143=7S for the alignment in the bug report. This PR makes it report 143=1I6= while BWA-MEM reports 140=1I9=. These two have the same score of course, but the difference is that the latter is left aligned. It would be nice to also report left-aligned alignments, but let’s consider that a separate issue (it should be possible to postprocess the alignment appropriately).

Runtime looks mostly unchanged between v0.12.0 and this branch except for SIM150.

Good, let’s hope that the slowdown on SIM150 is just an outlier. Or otherwise we’ll just have to accept the tradeoff and get a speedup some other way.

I guess the best way is to check that the alignment score is always better with the new version. Did you perform such a check on the simulated data?

Yes, here’s the full table:

Comparing Score-based accuracy

7b3b260 Merge pull request #379 from ksahlin/optimize-kslu2 85030cf If ungapped alignment is softclipped, ksw_extend soft-clipped ends

dataset	7b3b260	85030cf	difference
ecoli50-150-se	99.3834	99.4162	+0.0328
drosophila-50-se	96.3365	96.3364	-0.0001
drosophila-75-se	99.3890	99.3884	-0.0006
drosophila-100-se	99.3757	99.3761	+0.0004
drosophila-150-se	99.6204	99.6253	+0.0049
drosophila-200-se	99.6517	99.6578	+0.0061
drosophila-300-se	99.6218	99.6322	+0.0104
drosophila-500-se	99.4878	99.4992	+0.0114
maize-50-se	91.4195	91.4190	-0.0005
maize-75-se	94.3089	94.3101	+0.0012
maize-100-se	95.6757	95.6729	-0.0028
maize-150-se	97.2738	97.2835	+0.0097
maize-200-se	98.0256	98.0373	+0.0117
maize-300-se	98.4677	98.4814	+0.0137
maize-500-se	98.8132	98.8325	+0.0193
CHM13-50-se	93.1704	93.1708	+0.0004
CHM13-75-se	96.8539	96.8518	-0.0021
CHM13-100-se	97.8192	97.8209	+0.0017
CHM13-150-se	98.5407	98.5445	+0.0038
CHM13-200-se	98.6710	98.6782	+0.0072
CHM13-300-se	98.5879	98.5947	+0.0068
CHM13-500-se	98.5663	98.5738	+0.0075
rye-50-se	88.6941	88.6898	-0.0043
rye-75-se	91.8481	91.8486	+0.0005
rye-100-se	94.0778	94.0791	+0.0013
rye-150-se	96.3543	96.3630	+0.0087
rye-200-se	97.3639	97.3727	+0.0088
rye-300-se	98.0025	98.0182	+0.0157
rye-500-se	98.7715	98.7956	+0.0241
ecoli50-150-pe	99.5890	99.6164	+0.0273
drosophila-50-pe	99.5883	99.5885	+0.0002
drosophila-75-pe	99.7696	99.7697	+0.0002
drosophila-100-pe	99.7669	99.7676	+0.0007
drosophila-150-pe	99.7475	99.7512	+0.0037
drosophila-200-pe	99.7119	99.7160	+0.0041
drosophila-300-pe	99.6593	99.6668	+0.0076
drosophila-500-pe	99.5292	99.5388	+0.0096
maize-50-pe	96.5466	96.5461	-0.0004
maize-75-pe	97.9274	97.9272	-0.0001
maize-100-pe	98.6033	98.6026	-0.0008
maize-150-pe	99.1774	99.1813	+0.0039
maize-200-pe	99.3377	99.3441	+0.0064
maize-300-pe	99.4756	99.4829	+0.0073
maize-500-pe	99.3684	99.3814	+0.0131
CHM13-50-pe	98.2363	98.2363	+0.0000
CHM13-75-pe	98.7770	98.7772	+0.0002
CHM13-100-pe	99.0379	99.0382	+0.0004
CHM13-150-pe	99.2155	99.2174	+0.0019
CHM13-200-pe	99.1885	99.1927	+0.0041
CHM13-300-pe	99.1850	99.1903	+0.0054
CHM13-500-pe	99.0001	99.0100	+0.0099
rye-50-pe	94.9689	94.9691	+0.0002
rye-75-pe	96.8398	96.8385	-0.0012
rye-100-pe	97.9214	97.9195	-0.0020
rye-150-pe	98.8289	98.8330	+0.0041
rye-200-pe	99.1062	99.1108	+0.0045
rye-300-pe	99.4078	99.4156	+0.0079
rye-500-pe	99.3743	99.3908	+0.0165

Average difference se: +0.0068 Average difference pe: +0.0046

I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.

Would you have time to do some sanity check of differing alignments, or better, check the alignment scores?

I don’t have so much time left before going on vacation. I will not be able to look at your files, but may have time for some other checks.

ksahlin commented 6 months ago

These two have the same score of course, but the difference is that the latter is left aligned.

I see. Not consistently left/right aligning indels could possibly explain the lower precision. Nice! I think we even saw it in a previous evaluation using WFA2, IIRC.

I had convinced myself that it is ok that the score gets worse in some cases, but now I’m not so sure anymore. I’ll have to look into this.

I had expected that we would never end up with lower score. But I realize I don't fully understand the consequences of the new approach. If we did a 'full' (semi-global) realignment before, compared to only extending the softclipped ends in this branch, maybe the sum of the scores of the aligned 'pieces' is lower than the semi-global score?

Related: When you extract the softclipped ends to extend, I see you only take the softclipped part:

const std::string query_right = query.substr(info.query_end);

perhaps there is a benefit of taking an overlapping part of the NAM, such as kor k/2? I understand the logic of the surrounding code and the CIGAR might become more complicated. But do you think this could be a reason for the change in score?

ksahlin / strobealign

Use ksw2 to align soft-clipped ends from ungapped alignment #382

Comparing Score-based accuracy