Closed marcelm closed 9 months ago
Really great that you are performing this analysis!
I am happy to change default settings for 50, 75, 100.
For longer read lengths, I think we need to consider eventual speed tradeoffs. While having e.g. +0.020
improvement is nice, it would perhaps not be worth it if runtime increases with more than 10%.
As for the points 1-4:
All the runs have finished and I have updated the table to include results for read length 200 and 500. Even though the improvements for the larger read lengths are small (and we may not want to make these changes), I’m glad we have the numbers now because we now know that the current parameters are already quite good.
Regarding canonical read length 125: I think that if you are fine with changing the parameters for 100, we should also change them for 125. If I remember correctly, we set these to be an interpolation between 100 and 150, and we should just re-do the interpolation.
Here’s how the suggested changes impact rescue counts and similar.
Here is how rescue etc. numbers change for CHM13:
What | Current (20, 16, 4, 13) | Suggested (22, 18, 4, 14) |
---|---|---|
Total mapping sites tried: | 1804922 | 1788317 |
Total calls to ssw: | 194030 | 184665 |
Inconsistent NAM ends: | 3407 | 2966 |
Tried NAM rescue: | 87835 | 80766 |
The picture is the same for the other three datasets. Runtime doesn’t change measurably.
Exactly the same thing happens as for read length 200: Mapping sites tried, SSW calls, NAM rescue all go down, only inconsistent NAMs goes up.
CHM13 in detail:
What | Current (22, 18, 2, 12) | Suggested (22, 18, 3, 13) |
---|---|---|
Total mapping sites tried: | 1784220 | 1778315 |
Total calls to ssw: | 218392 | 212496 |
Inconsistent NAM ends: | 1876 | 1898 |
Tried NAM rescue: | 80865 | 80351 |
Again, no measurable runtime difference.
Same thing for all four datasets.
Very good. I agree with most your updates but just wanted to ask some specific things before merge.
Having (some) smaller seeds will help supplementary alignments (split alignments), thus, keeping/reducing l
is of some interest. If you have the measurements easily avialable/stored:
(20, 16, -2, -1)
compare to baseline and to current best? If comparable, I would opt for (20, 16, -2, -1)
to keep lower bound.(20, 16, -1, 3)
?l
, so (23, 17, 2, 11)
.Edit: I found 200 and 500 (canonical classes).
Having (some) smaller seeds will help supplementary alignments (split alignments), thus, keeping/reducing
l
is of some interest. If you have the measurements easily avialable/stored:
- Read length 125: perhaps interpolate to
(20, 16, -1, 3)
?
Sure, updated.
- Read length 400: perhaps worth keeping
l
, so(23, 17, 2, 11)
.
For this parameter combination, PE drosophila accuracy was lower than baseline (dropped from 95.6773 to 95.66765), which is why it was excluded from further measurements. (That is, I have no numbers for CHM13, maize, rye.)
These are the only options that do not reduce accuracy for at least one of the datasets:
k | s | l | u | PE | SE | SE diff | PE diff |
---|---|---|---|---|---|---|---|
23 | 17 | 5 | 11 | 94.670225 | 95.9844375 | +0.0397 | +0.015462 |
23 | 17 | 5 | 12 | 94.679975 | 95.98045 | +0.0494 | +0.011475 |
23 | 17 | 4 | 12 | 94.666525 | 95.9749 | +0.0360 | +0.005925 |
23 | 17 | 2 | 12 | 94.630525 | 95.968975 | 0.0 | 0.0 |
- Read length 100: how does
(20, 16, -2, -1)
compare to baseline and to current best? If comparable, I would opt for(20, 16, -2, -1)
to keep lower bound.
This parameter combination was excluded because maize SE accuracy dropped from 70.5148 to 70.438.
- I didn't see an updated table with measurements for 200 and 500 included.
I meant that I measured actual read lengths 200 and 500, but put them into the rows for canonical read length 250 and 400.
Judging from your comments, I see that you are a bit worried about overfitting and I totally agree. I think the updated settings for read lengths 50 and 75 should go in, but we can also just leave the rest alone. The optimization criterion is not accuracy, but something else that is not reflected in our test data. Until it is, we need to rely on your intuition.
Yes I am a bit worried about overfitting, but most parameter changes in the table would probably not affect supplementary alignments, some would even be beneficial (50-125).
Based on your comments, and, until we look into suppl. alignment, I suggest implementing the parameter changes you propose from your summary table for all datasets except canonical class 400. The proposed change for the 400 class is the only parameter change with a larger increase in l
.
Do you agree?
Agreed! Will update the PR appropriately and merge it later.
With this merged, I think it’s time for a release. Do you want to run your evaluations on commit 9b95973fbe1766f89fdbf5a3aacdb4435304d987?
Great, will do. I'll get back to you about when. Hopefully tonight or tomorrow .
I pulled and compiled https://github.com/ksahlin/strobealign/commit/9b95973fbe1766f89fdbf5a3aacdb4435304d987 and ran it on the larger simulated datasets (called v0.12.0 in plots). Plots attached. I am a bit puzzled by the results. In summary:
Given this, it seems like we cannot integrate any of the parameters until we know what is going on. The accuracy was computed with my original scripts (not jaccard).
I double and triple checked that I ran the right commit
$ strobealign-9b95973
strobealign reference [reads1] [reads2] {OPTIONS}
strobelign 0.11.0-103-g9b95973
accuracy_plot.pdf accuracy_plot_zoomed.pdf time_plot.pdf memory_plot.pdf percentage_aligned_plot.pdf
(All datasets in this analysis are PE reads)
What seems particularly strange is that the parameters for 150bp reads did not change in the commit from what I understand. I therefore wonder whether the decrease could be caused by some other commit since v0.11.0? I am not sure which the baseline commit that you benchmark against is.
Also, just mentioning that I simulate reads with mason using the following variant frequency w.r.t. to references: mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50
. How did you simulate your reads? This would not explain the discrepancy for the 150bp dataset though.
Yes, I noticed as well that some of the changes cannot be caused by the new parameters. I’m looking at old commits to find out what is going on.
I did not simulate reads myself but use the first 1 M reads of your datasets.
I’m looking at the memory consumption at the moment. What are the absolute values for rye, read length 50? For me, memory usage for 0.11.0 is 35.3 GB and 36.1 GB for main. So there’s a slight increase, but not that much (I’ll try to find out which commit caused this). Is it possible that the plot doesn’t show results for 0.11.0, but for some commit before the change to 64 bit randstrobe indices? That did increase memory usage from 33 GB to 35.3 GB.
Yes you are right. In the plot labelled as v0.11.0_cmake
I was running strobealign-e0764b6
. (https://github.com/ksahlin/strobealign/commit/e0764b6a24f05f9b6d0853286af93080a3b0452b) . It says in the commit that we have already changed to 64bits at that point.
Below I have pasted the exact values
tool,dataset,read_length,time,memory
strobealign_v071,rye,50,646.7549999999999,50.286916
strobealign_v071_map,rye,50,443.653,50.286916
strobealign_v080_conda,rye,50,696.6300000000001,48.79224
strobealign_v080_cmake,rye,50,862.1099999999999,48.792032
strobealign_v080_cmake_map,rye,50,697.26,48.791976
strobealign_v090_cmake,rye,50,824.16,48.792368
strobealign_v0110_cmake,rye,50,734.5899999999999,33.727584
strobealign_v0110_cmake_map,rye,50,564.01,33.727556
strobealign_v0120_cmake,rye,50,780.1,36.969792
strobealign_v0120_cmake_map,rye,50,621.8399999999999,36.969716
These are peak (RSS) reported during alignment with /usr/bin/time -v
That did increase memory usage from 33 GB to 35.3 GB.
Otherwise my measurments seem to agree well with what you see.
I have also posted CHM15 50nt below
strobealign_v071,CHM13,50,205.613,31.445908
strobealign_v071_map,CHM13,50,143.928,31.086756
strobealign_v080_conda,CHM13,50,209.55,23.96776
strobealign_v080_cmake,CHM13,50,224.95000000000002,23.967372
strobealign_v080_cmake_map,CHM13,50,185.74,23.967352
strobealign_v090_cmake,CHM13,50,246.75000000000003,23.967708
strobealign_v0110_cmake,CHM13,50,219.26000000000002,14.550108
strobealign_v0110_cmake_map,CHM13,50,170.55999999999997,14.393092
strobealign_v0120_cmake,CHM13,50,243.62,15.284884
strobealign_v0120_cmake_map,CHM13,50,202.77,15.241692
Did you change the number of threads?
Nope, I am running /usr/bin/time -v strobealign-9b95973 -t 16 -o {tmp_sam} {input.ref} {tmp_fq_L} {tmp_fq_R}
as with all previous commits.
Details are here: https://github.com/ksahlin/alignment_evaluation/blob/master/evaluation_genomes/Snakefile
I found multiple commits that change memory usage for rye.
memory usage (GiB) | Commit | What |
---|---|---|
30.2 | Baseline (new index structure, before switching to 64 bit bucket indices) | |
32.1 | bdf4012 | Switch to 64 bit bucket indices |
33.3 | 6c67bfc | Use xxh64 to hash the s-mer |
35.2 | de58725 | Compute strobemers in parallel |
Commit 6c67bfc (hashing s-mers) needs more RAM because more strobemers are produced. Before:
Estimated total memory usage: 33.67 GB
Total strobemers: 1380324847
After 6c67bfc:
Total strobemers: 1454354777
Estimated total memory usage: 34.85 GB
(I was confused by the estimate being too high, but I now realize that the estimate uses 1 GB = $10^9$ bytes while the measurements use 1 GiB = $2^{30}$ bytes.)
I’ll need to investigate what happens in de58725.
Note for later: The decrease in accuracy is due to commit 91f2223, likely due to nam_ids being re-used (and colliding). Commit 4ac1e1b bumps accuracy back to the previous level, need to turn this into a proper fix.
Note for later: The decrease in accuracy is due to commit https://github.com/ksahlin/strobealign/commit/91f22234baf192a3167354a7f2bbfefb49cb2162, likely due to nam_ids being re-used (and colliding). Commit https://github.com/ksahlin/strobealign/commit/4ac1e1b8253d2c756ba59714301d97368635f401 bumps accuracy back to the previous level, need to turn this into a proper fix.
I see. That’s good news. I was worried it was gonna be https://github.com/ksahlin/strobealign/commit/6c67bfca9e4898064ac7ad907b31bcfee2abb85a which would be discouraging since, theoretically, hash s-mers should be the right thing to do.
As you saw, I submitted #350 to fix the accuracy issue.
The only remaining thing is the decreased mapping rate. This is caused by commit 0239650, which is the commit that applies the tuned parameters.
In my opinion, the decrease in mapping rate is a good thing since the reads that are now no longer mapped were incorrect anyway.
The accuracy is currently the percentage of all reads that were mapped correctly. With this, the aim should be to get the mapping rate close to the accuracy.
If we changed it such that accuracy is the percentage of mapped reads that were mapped correctly, which would also make sense, then accuracy would go up even more by the tuned parameters.
I guess what I’m saying is: We should be glad precision goes up, but the plots don’t make that clear.
(Hm, thinking about whether the precision/recall terminology is appropriate here, it appears to me we don’t have any true negatives, but maybe that’s not necessary.)
I also wonder whether it makes sense to somehow evaluate multimappers differently when computing accuracy. A read that maps equally well to two or more places should still be counted as somewhat correct when it’s assigned to one of those positions, even if that wasn’t the where it came from.
Interesting phenomenon. I had not anticipated this. While high precision is good, here is my take:
While I generally agree that accuracy (as measured by tot_correct_mapped/tot_reads) is the metric to primarily optimise for, there are situations (perfect repeats) where more mapped reads are good such as for coverage based applications. While I don't have data to support, it is likely that most of the loss in mapped reads was to repetitive regions (including perfect repeats).
If could be that much smaller seeds lead to many hits going beyond the hard threshold (1000?) and simply become unmapped (strobealign gives up). While it is okay to take some loss in percentage mapped, it is worth monitoring. The drop we saw is probably acceptable (borderline) given the increase in accuracy.
I am not sure why the decline is so big even for 100nt reads. I will try the next commit (after your optimization) and see the final damage in percentage mapped at that point.
So the accuracy bug does not affect the percentage mapped reads? (I can probably answer this myself in my next evaluation pass after your optimization)
So the accuracy bug does not affect the percentage mapped reads? (I can probably answer this myself in my next evaluation pass after your optimization)
The percentage of mapped reads was not affected by the accuracy bug (NAM id collisions). It changed only because of the updated parameter settings.
I took the liberty of running commit 1587954 (current master). New plots attached. The memory bug seems fixed, but note still slightly higher on rye compared to commit https://github.com/ksahlin/strobealign/commit/e0764b6a24f05f9b6d0853286af93080a3b0452b. Also, why don't I have the same absolute numbers as you in terms of memory usage(?), hm. Is it something to do with that I'm running 16 threads?
Accuracy is better for 50nt PE reads, slightly better for 75 and nearly unchanged for remaining read lenghts (a tad worse for 500nt). Quite a hit in number of unaligned reads as well as we discussed in https://github.com/ksahlin/strobealign/issues/354. Worth to investigate further.
accuracy_plot.pdf accuracy_plot_zoomed.pdf time_plot.pdf percentage_aligned_plot.pdf memory_plot.pdf
This is the result of an exhaustive search for parameter combinations that improve accuracy over the currently used ones. (In total, strobealign ran over 7000 times.)
Due to a typo, I need to re-do some measurements for length 200, and length 500 is not finished, yet. I will updated the PR once they are done.These are the changes I suggest so far. The "improvement" columns show how much avergae accuracy increases. As in #339, I have ranked the parameter combinations by PE accuracy plus 1/4 SE accuracy.
300(22, 18, 2, 12)(23, 19, 6, 12)+0.042+0.020To Do
I have the accuracy measurements in an SQLite database file and can relatively easy produce tables to show more detailed statistics if needed.
Edit: Added numbers for read lengths 200 and 500 (canonical 250 and 400).