Closed marcelm closed 7 months ago
This is how it looks:
Doesn’t look perfect, but a little bit better (this PR is "randompe"). I will quantify this later.
I haven’t added the random selection to rescue_read
. Maybe it would benefit as well. And I need to check whether what I already added works correctly.
Standard deviations:
bwa 0.003596
main 0.053338
shuffle-nams 0.023453
randompe 0.015721
Very nice to see the progression!
Now that we are getting closer to uniform it would be interesting to see the coverage fluctuations for the 'true alignments' output by mason_simulate
. I am thinking there must be some natural fluctuation that may also come from small accuracy differences. Although the fluctuations in perfect repeat regions are probably much larger than any accuracy misses. Nevertheless, it's probably easy to add these values to the plot(?).
I don't want to derail the discussion here, but related: Could you also do a sanity check on this dataset the scoring for placing both reads in a proper pair 'works'. That is, if you notice any reads on different E. colis it would be a good case for fixing.
(related to what we discussed with @psj1997 over email)
Added a purple line to show values for the "truth" BAM:
Standard deviation is 0.003249. So BWA-MEM is very close.
Well, what do you know ... I fixed a bug where I would stop a bit too early looking at all alignments and it looks a lot better (the fix is labeled "randompe2"). Note different colors and I excluded main to reduce clutter.
Algorithm | stddev |
---|---|
bwa | 0.003596 |
randompe | 0.015721 |
randompe2 | 0.003701 |
truth | 0.003249 |
I now changed it so that even rescue_read
uses the same logic to pick a random best alignment. Named "randompe3" here:
This actually results in a slighly higher stddev of 0.004084, but this is probably just an artifact of not testing enough datasets (the numbers are from mapping a single "ecoli50" dataset with 1 milion 2x150 bp read pairs).
Neat! Looks like you solved it over night :) Whats the effect on runtime? Probably good to check both on this dataset and on CHM and rye.
If runtimes look reasonable I am happy to merge.
The runtime appears to be unchanged.
Here are changes in accuracy on (most) datasets (average difference PE: -0.0018):
dataset | e8e5ec3 (main) | 54f9fe4 (this PR) | difference |
---|---|---|---|
ecoli50-150-pe | 32.64545 | 32.59525 | -0.0502 |
drosophila-50-pe | 90.17505 | 90.1927 | +0.0177 |
drosophila-75-pe | 91.62985 | 91.6523 | +0.0225 |
drosophila-100-pe | 92.37945 | 92.3832 | +0.0037 |
drosophila-150-pe | 93.1991 | 93.2278 | +0.0287 |
drosophila-200-pe | 93.5143 | 93.49 | -0.0243 |
drosophila-300-pe | 95.37105 | 95.34695 | -0.0241 |
drosophila-500-pe | 95.673 | 95.6878 | +0.0148 |
maize-50-pe | 71.48905 | 71.4953 | +0.0063 |
maize-75-pe | 82.1222 | 82.1104 | -0.0118 |
maize-100-pe | 87.1474 | 87.14605 | -0.0014 |
maize-150-pe | 91.6918 | 91.70585 | +0.0140 |
maize-200-pe | 92.93575 | 92.93905 | +0.0033 |
maize-300-pe | 96.71725 | 96.7143 | -0.0029 |
maize-500-pe | 97.29125 | 97.2786 | -0.0126 |
CHM13-50-pe | 90.64065 | 90.657 | +0.0163 |
CHM13-75-pe | 92.51435 | 92.53705 | +0.0227 |
CHM13-100-pe | 93.22665 | 93.2251 | -0.0015 |
CHM13-150-pe | 94.1352 | 94.13245 | -0.0028 |
CHM13-200-pe | 94.4225 | 94.43375 | +0.0113 |
CHM13-300-pe | 95.63475 | 95.6424 | +0.0076 |
CHM13-500-pe | 95.97645 | 95.959 | -0.0174 |
rye-50-pe | 69.185 | 69.1466 | -0.0384 |
rye-75-pe | 80.59005 | 80.56485 | -0.0252 |
Awesome! Approved to merge.
Continues work started in #364
I have not done any measurements, will do so later.
Closes #359