ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

Random paired-end alignment selection #367

Closed marcelm closed 7 months ago

marcelm commented 7 months ago

Continues work started in #364

I have not done any measurements, will do so later.

Closes #359

marcelm commented 7 months ago

This is how it looks:

genomes-pe

Doesn’t look perfect, but a little bit better (this PR is "randompe"). I will quantify this later.

I haven’t added the random selection to rescue_read. Maybe it would benefit as well. And I need to check whether what I already added works correctly.

marcelm commented 7 months ago

Standard deviations:

bwa             0.003596
main            0.053338
shuffle-nams    0.023453
randompe        0.015721
ksahlin commented 7 months ago

Very nice to see the progression!

Now that we are getting closer to uniform it would be interesting to see the coverage fluctuations for the 'true alignments' output by mason_simulate. I am thinking there must be some natural fluctuation that may also come from small accuracy differences. Although the fluctuations in perfect repeat regions are probably much larger than any accuracy misses. Nevertheless, it's probably easy to add these values to the plot(?).

ksahlin commented 7 months ago

I don't want to derail the discussion here, but related: Could you also do a sanity check on this dataset the scoring for placing both reads in a proper pair 'works'. That is, if you notice any reads on different E. colis it would be a good case for fixing.

(related to what we discussed with @psj1997 over email)

marcelm commented 7 months ago

Added a purple line to show values for the "truth" BAM: genomes-pe

Standard deviation is 0.003249. So BWA-MEM is very close.

marcelm commented 7 months ago

Well, what do you know ... I fixed a bug where I would stop a bit too early looking at all alignments and it looks a lot better (the fix is labeled "randompe2"). Note different colors and I excluded main to reduce clutter.

genomes-pe

Algorithm stddev
bwa 0.003596
randompe 0.015721
randompe2 0.003701
truth 0.003249
marcelm commented 7 months ago

I now changed it so that even rescue_read uses the same logic to pick a random best alignment. Named "randompe3" here:

genomes-pe

This actually results in a slighly higher stddev of 0.004084, but this is probably just an artifact of not testing enough datasets (the numbers are from mapping a single "ecoli50" dataset with 1 milion 2x150 bp read pairs).

ksahlin commented 7 months ago

Neat! Looks like you solved it over night :) Whats the effect on runtime? Probably good to check both on this dataset and on CHM and rye.

If runtimes look reasonable I am happy to merge.

marcelm commented 7 months ago

The runtime appears to be unchanged.

Here are changes in accuracy on (most) datasets (average difference PE: -0.0018):

dataset e8e5ec3 (main) 54f9fe4 (this PR) difference
ecoli50-150-pe 32.64545 32.59525 -0.0502
drosophila-50-pe 90.17505 90.1927 +0.0177
drosophila-75-pe 91.62985 91.6523 +0.0225
drosophila-100-pe 92.37945 92.3832 +0.0037
drosophila-150-pe 93.1991 93.2278 +0.0287
drosophila-200-pe 93.5143 93.49 -0.0243
drosophila-300-pe 95.37105 95.34695 -0.0241
drosophila-500-pe 95.673 95.6878 +0.0148
maize-50-pe 71.48905 71.4953 +0.0063
maize-75-pe 82.1222 82.1104 -0.0118
maize-100-pe 87.1474 87.14605 -0.0014
maize-150-pe 91.6918 91.70585 +0.0140
maize-200-pe 92.93575 92.93905 +0.0033
maize-300-pe 96.71725 96.7143 -0.0029
maize-500-pe 97.29125 97.2786 -0.0126
CHM13-50-pe 90.64065 90.657 +0.0163
CHM13-75-pe 92.51435 92.53705 +0.0227
CHM13-100-pe 93.22665 93.2251 -0.0015
CHM13-150-pe 94.1352 94.13245 -0.0028
CHM13-200-pe 94.4225 94.43375 +0.0113
CHM13-300-pe 95.63475 95.6424 +0.0076
CHM13-500-pe 95.97645 95.959 -0.0174
rye-50-pe 69.185 69.1466 -0.0384
rye-75-pe 80.59005 80.56485 -0.0252
ksahlin commented 7 months ago

Awesome! Approved to merge.