Random paired-end alignment selection

marcelm commented 7 months ago

Continues work started in #364

I have not done any measurements, will do so later.

Closes #359

marcelm commented 7 months ago

This is how it looks:

genomes-pe

Doesn’t look perfect, but a little bit better (this PR is "randompe"). I will quantify this later.

I haven’t added the random selection to rescue_read. Maybe it would benefit as well. And I need to check whether what I already added works correctly.

marcelm commented 7 months ago

Standard deviations:

bwa             0.003596
main            0.053338
shuffle-nams    0.023453
randompe        0.015721

ksahlin commented 7 months ago

Very nice to see the progression!

Now that we are getting closer to uniform it would be interesting to see the coverage fluctuations for the 'true alignments' output by mason_simulate. I am thinking there must be some natural fluctuation that may also come from small accuracy differences. Although the fluctuations in perfect repeat regions are probably much larger than any accuracy misses. Nevertheless, it's probably easy to add these values to the plot(?).

ksahlin commented 7 months ago

I don't want to derail the discussion here, but related: Could you also do a sanity check on this dataset the scoring for placing both reads in a proper pair 'works'. That is, if you notice any reads on different E. colis it would be a good case for fixing.

(related to what we discussed with @psj1997 over email)

marcelm commented 7 months ago

Added a purple line to show values for the "truth" BAM: genomes-pe

Standard deviation is 0.003249. So BWA-MEM is very close.

marcelm commented 7 months ago

Well, what do you know ... I fixed a bug where I would stop a bit too early looking at all alignments and it looks a lot better (the fix is labeled "randompe2"). Note different colors and I excluded main to reduce clutter.

genomes-pe

Algorithm	stddev
bwa	0.003596
randompe	0.015721
randompe2	0.003701
truth	0.003249

marcelm commented 7 months ago

I now changed it so that even rescue_read uses the same logic to pick a random best alignment. Named "randompe3" here:

genomes-pe

This actually results in a slighly higher stddev of 0.004084, but this is probably just an artifact of not testing enough datasets (the numbers are from mapping a single "ecoli50" dataset with 1 milion 2x150 bp read pairs).

ksahlin commented 7 months ago

Neat! Looks like you solved it over night :) Whats the effect on runtime? Probably good to check both on this dataset and on CHM and rye.

If runtimes look reasonable I am happy to merge.

marcelm commented 7 months ago

The runtime appears to be unchanged.

Here are changes in accuracy on (most) datasets (average difference PE: -0.0018):

dataset	e8e5ec3 (main)	54f9fe4 (this PR)	difference
ecoli50-150-pe	32.64545	32.59525	-0.0502
drosophila-50-pe	90.17505	90.1927	+0.0177
drosophila-75-pe	91.62985	91.6523	+0.0225
drosophila-100-pe	92.37945	92.3832	+0.0037
drosophila-150-pe	93.1991	93.2278	+0.0287
drosophila-200-pe	93.5143	93.49	-0.0243
drosophila-300-pe	95.37105	95.34695	-0.0241
drosophila-500-pe	95.673	95.6878	+0.0148
maize-50-pe	71.48905	71.4953	+0.0063
maize-75-pe	82.1222	82.1104	-0.0118
maize-100-pe	87.1474	87.14605	-0.0014
maize-150-pe	91.6918	91.70585	+0.0140
maize-200-pe	92.93575	92.93905	+0.0033
maize-300-pe	96.71725	96.7143	-0.0029
maize-500-pe	97.29125	97.2786	-0.0126
CHM13-50-pe	90.64065	90.657	+0.0163
CHM13-75-pe	92.51435	92.53705	+0.0227
CHM13-100-pe	93.22665	93.2251	-0.0015
CHM13-150-pe	94.1352	94.13245	-0.0028
CHM13-200-pe	94.4225	94.43375	+0.0113
CHM13-300-pe	95.63475	95.6424	+0.0076
CHM13-500-pe	95.97645	95.959	-0.0174
rye-50-pe	69.185	69.1466	-0.0384
rye-75-pe	80.59005	80.56485	-0.0252

ksahlin commented 7 months ago

Awesome! Approved to merge.

ksahlin / strobealign

Random paired-end alignment selection #367