Optimize parameters - Githubissues

marcelm commented 9 months ago

This is the result of an exhaustive search for parameter combinations that improve accuracy over the currently used ones. (In total, strobealign ran over 7000 times.)

~~Due to a typo, I need to re-do some measurements for length 200, and length 500 is not finished, yet. I will updated the PR once they are done.~~

These are the changes I suggest so far. The "improvement" columns show how much avergae accuracy increases. As in #339, I have ranked the parameter combinations by PE accuracy plus 1/4 SE accuracy.

Canonical read length	Current parameters	suggested parameters	SE improvement	PE improvement
50	(20, 16, -3, 2)	(18, 14, -2, -2)	+2.478	+1.116
75	(20, 16, -3, 2)	(20, 16, -2, -2)	+0.185	+0.288
100	(20, 16, -2, 2)	(20, 16, -1, -1)	+0.070	+0.137
125	(20, 16, -1, 4)	(20, 16, -1, 3)	no data, interpolated
150	(20, 16, 1, 7)	(20, 16, 1, 7)	no change, already best
250	(20, 16, 4, 13)	(22, 18, 4, 14)	+0.028	+0.020
~~300~~	~~(22, 18, 2, 12)~~	~~(23, 19, 6, 12)~~	~~+0.042~~	~~+0.020~~
300*	(22, 18, 2, 12)	(22, 18, 3, 13)	+0.033	+0.013
400	(23, 17, 2, 12)	(23, 17, 5, 11)	+0.040	+0.015

The suggested parameters for read length 50 are different from those we agreed on in #272 because I have now tested more parameter combinations. The one given here improves PE accuracy even more (at the cost of some SE accuracy).
The second suggestion for length 300 is the second-best parameter combination. I kept it because the it has $u-l=10$ like the current parameters. I have no idea whether that has any importance, though.
Suggestion for canonical read length 250 is based on measurements on the 200 bp dataset.
Suggestion for canonical read length 400 is based on measurements on the 500 bp dataset.

To Do

[X] Decide what to do with canonical read length 125
[x] Decide which 300-bp-parameters to use
[x] Add changelog entry

I have the accuracy measurements in an SQLite database file and can relatively easy produce tables to show more detailed statistics if needed.

Closes #272
Closes #339

Edit: Added numbers for read lengths 200 and 500 (canonical 250 and 400).

# Read length 50 Current settings: (20, 16, -3, 2) (Source says -4 instead of -3, but it was effectively -3 because w_min was forced to be >=1.) ## SE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 50 and k=20 and s=16 and l=-3 and u=2; 64.1089|20|16|-3|2 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 50 and s=k-4 and u=l group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 10; 66.8115|21|17|-4|-4 66.81135|20|16|-4|-4 66.7599|22|18|-4|-4 66.758225|19|15|-3|-3 66.645075|18|14|-3|-3 66.615975|23|19|-4|-4 66.586625|18|14|-2|-2 66.349325|19|15|-2|-2 66.015825|20|16|-3|-3 65.6583|21|17|-3|-3 ``` ## PE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen = 50 and k=20 and s=16 and l=-3 and u=2; 80.10295|20|16|-3|2 ``` Optimized: ``` select avg(accuracy),k,s,l,u,avg( from results where paired=1 and readlen = 50 and s=k-4 and u=l group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 10; 81.219425|18|14|-2|-2 81.203175|19|15|-2|-2 81.178675|20|16|-3|-3 81.15535|22|18|-4|-4 81.1384625|23|19|-4|-4 81.1335375|21|17|-3|-3 81.102975|21|17|-4|-4 81.0823375|22|18|-3|-3 81.0651125|20|16|-4|-4 81.02335|18|14|-1|-1 ``` ## Weighted PE + 1/4 SE ``` select *,(pe.aacc+0.25*se.aacc)/1.25,se.aacc-64.1089,pe.aacc-80.10295 from (select avg(accuracy) as aacc,k,s,l,u from results where paired = 0 and readlen=50 group by k,s,l,u having count(genome) = 4) as se join (select avg(accuracy) as aacc,k,s,l,u from results where paired=1 and readlen=50 group by k,s,l,u having count(genome) = 4) as pe on se.k = pe.k and se.s = pe.s and se.l = pe.l and se.u = pe.u where se.l=se.u order by pe.aacc + 0.25 * se.aacc desc limit 10; 66.586625|18|14|-2|-2|81.219425|18|14|-2|-2|78.292865|2.47772499999999|1.11647499999999 # ***** Picked ***** 66.7599|22|18|-4|-4|81.15535|22|18|-4|-4|78.27626|2.651|1.05240000000001 66.8115|21|17|-4|-4|81.102975|21|17|-4|-4|78.24468|2.70259999999999|1.00002499999999 66.615975|23|19|-4|-4|81.1384625|23|19|-4|-4|78.233965|2.507075|1.0355125 66.349325|19|15|-2|-2|81.203175|19|15|-2|-2|78.232405|2.24042499999999|1.10022499999999 66.81135|20|16|-4|-4|81.0651125|20|16|-4|-4|78.21436|2.70244999999998|0.962162499999991 66.015825|20|16|-3|-3|81.178675|20|16|-3|-3|78.146105|1.906925|1.07572499999999 66.758225|19|15|-3|-3|80.9721125|19|15|-3|-3|78.129335|2.64932499999999|0.869162499999987 65.6583|21|17|-3|-3|81.1335375|21|17|-3|-3|78.03849|1.54939999999999|1.0305875 66.645075|18|14|-3|-3|80.8445125|18|14|-3|-3|78.004625|2.536175|0.741562499999986 ``` # Read length 75 ## SE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 75 and k=20 and s=16 and l=-3 and u=2; 74.761775|20|16|-3|2 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 75 and s=k-4 and u=l group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 10; 75.0243|23|19|-3|-3 75.003925|22|18|-3|-3 74.9946|18|14|-1|-1 74.988925|19|15|-1|-1 74.988875|21|17|-3|-3 74.947225|20|16|-2|-2 74.936225|20|16|-3|-3 74.8638|19|15|-2|-2 72.05065|21|17|1|1 71.25655|19|15|2|2 ``` ## PE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen = 75 and k=20 and s=16 and l=-3 and u=2; 86.511525|20|16|-3|2 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen = 75 and s=k-4 and u=l group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 10; 86.833475|22|18|-2|-2 86.8255125|23|19|-2|-2 86.818425|21|17|-2|-2 86.7994125|20|16|-2|-2 86.7724|19|15|-1|-1 86.76275|23|19|-3|-3 86.7576875|18|14|-1|-1 86.734175|21|17|-1|-1 86.732175|22|18|-3|-3 86.7238125|22|18|-1|-1 ``` ## Weighted ``` select *,(pe.aacc+0.25*se.aacc)/1.25,se.aacc-74.761775,pe.aacc-86.511525 from (select avg(accuracy) as aacc,k,s,l,u from results where paired = 0 and readlen=75 group by k,s,l,u having count(genome) = 4) as se join (select avg(accuracy) as aacc,k,s,l,u from results where paired=1 and readlen=75 group by k,s,l,u having count(genome) = 4) as pe on se.k = pe.k and se.s = pe.s and se.l = pe.l and se.u = pe.u order by pe.aacc + 0.25 * se.aacc desc limit 10; 74.947225|20|16|-2|-2|86.7994125|20|16|-2|-2|84.428975|0.185450000000003|0.287887499999997 # ***** Picked ***** 74.988925|19|15|-1|-1|86.7724|19|15|-1|-1|84.415705|0.227150000000009|0.260874999999999 75.0243|23|19|-3|-3|86.76275|23|19|-3|-3|84.41506|0.262525000000011|0.251224999999991 74.9946|18|14|-1|-1|86.7576875|18|14|-1|-1|84.40507|0.232825000000005|0.246162499999997 75.003925|22|18|-3|-3|86.732175|22|18|-3|-3|84.386525|0.242150000000009|0.220649999999992 74.988875|21|17|-3|-3|86.6811875|21|17|-3|-3|84.342725|0.227099999999993|0.169662499999987 74.8734|18|14|-1|0|86.6850375|18|14|-1|0|84.32271|0.111625000000004|0.173512500000001 74.936225|20|16|-3|-3|86.643825|20|16|-3|-3|84.302305|0.174449999999993|0.132299999999987 74.882975|22|18|-3|-2|86.6519875|22|18|-3|-2|84.298185|0.121200000000002|0.140462499999984 74.891725|21|17|-3|-2|86.6393125|21|17|-3|-2|84.289795|0.129949999999994|0.127787499999997 ``` # Read length 100 (20, 16, -2, 2) ## SE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 100 and k=20 and s=16 and l=-2 and u=2; 79.943775|20|16|-2|2 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 100 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 10; 80.0287|23|19|-2|-2 80.020925|19|15|0|0 80.014925|18|14|0|0 80.0136|20|16|-1|-1 ***** 80.012825|22|18|-2|-2 79.974325|21|17|-2|-1 79.9687|19|15|-1|3 79.943775|20|16|-2|2 ``` ## PE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen = 100 and k=20 and s=16 and l=-2 and u=2; 89.4313375|20|16|-2|2 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen = 100 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 89.615525|21|17|0|0 89.61525|18|14|1|1 89.61365|18|14|1|2 89.6109875|19|15|1|1 89.606725|20|16|0|0 89.6066|23|19|-1|-1 89.60435|22|18|0|0 89.595125|22|18|-1|-1 89.589425|19|15|1|2 89.584075|21|17|-1|-1 89.5818625|20|16|0|1 89.5691|21|17|0|1 89.5678875|20|16|-1|-1 # ***** Picked ***** 89.562025|22|18|-1|0 89.5579875|23|19|-1|0 89.552375|21|17|-1|0 89.5458875|20|16|-1|0 89.5421|19|15|0|1 89.5412125|19|15|0|0 89.5344125|23|19|-2|-1 ``` ## Weighted ```select *,(pe.aacc+0.25*se.aacc)/1.25,se.aacc-79.943775,pe.aacc-89.4313375 from (select avg(accuracy) as aacc,k,s,l,u from results where paired = 0 and readlen=100 group by k,s,l,u having count(genome) = 4) as se join (select avg(accuracy) as aacc,k,s,l,u from results where paired=1 and readlen=100 group by k,s,l,u having count(genome) = 4) as pe on se.k = pe.k and se.s = pe.s and se.l = pe.l and se.u = pe.u order by pe.aacc + 0.25 * se.aacc desc limit 10; 80.0136|20|16|-1|-1|89.5678875|20|16|-1|-1|87.65703|0.0698249999999945|0.13655 # ***** Picked ***** 80.020925|19|15|0|0|89.5412125|19|15|0|0|87.637155|0.0771500000000032|0.109875000000002 80.0287|23|19|-2|-2|89.5259125|23|19|-2|-2|87.62647|0.0849249999999984|0.0945750000000061 80.014925|18|14|0|0|89.523325|18|14|0|0|87.621645|0.0711500000000029|0.0919875000000019 80.012825|22|18|-2|-2|89.513425|22|18|-2|-2|87.613305|0.0690500000000043|0.0820875000000143 79.974325|21|17|-2|-1|89.501075|21|17|-2|-1|87.595725|0.030549999999991|0.0697375000000164 79.9687|19|15|-1|3|89.4315125|19|15|-1|3|87.53895|0.0249249999999961|0.000174999999998704 79.943775|20|16|-2|2|89.4313375|20|16|-2|2|87.533825|-1.4210854715202e-14|0.0 ``` # Read length 150 (20, 16, 1, 7) ... no other parameter combination was better # Read length 200 (20, 16, 4, 13) ## SE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen = 200 and k=20 and s=16 and l=4 and u=13; 89.444775|20|16|4|13 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen=200 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 89.485525|22|18|3|14 89.4725|22|18|4|14 89.4689|23|19|3|12 89.4461|22|18|2|12 89.444775|20|16|4|13 ``` ## PE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen = 200 and k=20 and s=16 and l=4 and u=13; 92.9984375|20|16|4|13 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen=200 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 93.0187875|22|18|4|14 93.017725|23|19|3|12 93.0122875|22|18|3|14 93.0078625|21|17|3|8 92.9984375|20|16|4|13 92.9683125|20|16|1|7 ``` ## Weighted ``` select *,(pe.aacc+0.25*se.aacc)/1.25,se.aacc-89.444775,pe.aacc-92.9984375 from (select avg(accuracy) as aacc,k,s,l,u from results where paired=0 and readlen=200 group by k,s,l,u having count(genome) = 4) as se join (select avg(accuracy) as aacc,k,s,l,u from results where paired=1 and readlen=200 group by k,s,l,u having count(genome) = 4) as pe on se.k = pe.k and se.s = pe.s and se.l = pe.l and se.u = pe.u order by pe.aacc + 0.25 * se.aacc desc limit 10; 89.4725|22|18|4|14|93.0187875|22|18|4|14|92.30953|0.0277249999999896|0.0203500000000076 89.4689|23|19|3|12|93.017725|23|19|3|12|92.30796|0.0241249999999837|0.0192875000000186 89.485525|22|18|3|14|93.0122875|22|18|3|14|92.306935|0.0407499999999885|0.013850000000005 89.444775|20|16|4|13|92.9984375|20|16|4|13|92.287705|-1.4210854715202e-14|1.4210854715202e-14 ``` # Read length 300 (22, 18, 2, 12) ## SE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen=300 and k=22 and s=18 and l=2 and u=12; 92.43395|22|18|2|12 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen=300 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 92.476375|23|19|6|12 92.4674|22|18|3|13 92.457725|22|18|4|12 92.455375|21|17|6|13 92.454275|23|19|2|13 92.45265|22|18|2|13 92.44685|23|19|2|12 92.44655|23|19|3|11 92.43395|22|18|2|12 ``` ## PE Baseline: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen=300 and k=22 and s=18 and l=2 and u=12; 95.502225|22|18|2|12 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen=300 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 95.522425|23|19|6|12 95.5148|21|17|6|13 95.5148|22|18|3|13 95.513825|23|19|2|13 95.511575|22|18|4|12 95.51055|22|18|2|13 95.502225|22|18|2|12 ``` ## Weighted ``` select *,(pe.aacc+0.25*se.aacc)/1.25,se.aacc-92.43395,pe.aacc-95.502225 from (select avg(accuracy) as aacc,k,s,l,u from results where paired = 0 and readlen=300 group by k,s,l,u having count(genome) = 4) as se join (select avg(accuracy) as aacc,k,s,l,u from results where paired=1 and readlen=300 group by k,s,l,u having count(genome) = 4) as pe on se.k = pe.k and se.s = pe.s and se.l = pe.l and se.u = pe.u order by pe.aacc + 0.25 * se.aacc desc limit 10; 92.476375|23|19|6|12|95.522425|23|19|6|12|94.913215|0.0424249999999944|0.0202000000000169 # **** Picked **** 92.4674|22|18|3|13|95.5148|22|18|3|13|94.90532|0.0334500000000162|0.0125749999999982 # **** Picked **** 92.455375|21|17|6|13|95.5148|21|17|6|13|94.902915|0.0214250000000078|0.0125750000000124 92.454275|23|19|2|13|95.513825|23|19|2|13|94.901915|0.0203249999999997|0.0116000000000014 92.457725|22|18|4|12|95.511575|22|18|4|12|94.900805|0.0237750000000148|0.00934999999999775 92.45265|22|18|2|13|95.51055|22|18|2|13|94.89897|0.0187000000000097|0.00832499999999925 92.43395|22|18|2|12|95.502225|22|18|2|12|94.88857|1.4210854715202e-14|0.0 ``` # Read length 500 ## SE Baseline: ``` select count(*),avg(accuracy),k,s,l,u from results where paired=0 and readlen = 500 and k=23 and s=17 and l=2 and u=12; 4|94.630525|23|17|2|12 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=0 and readlen=500 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 94.679975|23|17|5|12 94.670225|23|17|5|11 94.666525|23|17|4|12 94.654925|23|17|5|10 94.64695|22|16|9|12 94.644725|22|16|4|11 94.630525|23|17|2|12 ``` ## PE Baseline: ``` select count(*),avg(accuracy),k,s,l,u from results where paired=1 and readlen = 500 and k=23 and s=17 and l=2 and u=12; 4|95.968975|23|17|2|12 ``` Optimized: ``` select avg(accuracy),k,s,l,u from results where paired=1 and readlen=500 group by k, s, l, u having count(*)=4 order by avg(accuracy) desc limit 20; 95.9844375|23|17|5|11 95.98045|23|17|5|12 95.9749|23|17|4|12 95.968975|23|17|2|12 ``` ## Weighted ``` select *,(pe.aacc+0.25*se.aacc)/1.25,se.aacc-94.630525,pe.aacc-95.968975 from (select avg(accuracy) as aacc,k,s,l,u from results where paired=0 and readlen=500 group by k,s,l,u having count(genome) = 4) as se join (select avg(accuracy) as aacc,k,s,l,u from results where paired=1 and readlen=500 group by k,s,l,u having count(genome) = 4) as pe on se.k = pe.k and se.s = pe.s and se.l = pe.l and se.u = pe.u order by pe.aacc + 0.25 * se.aacc desc limit 10; 94.670225|23|17|5|11|95.9844375|23|17|5|11|95.721595|0.0397000000000105|0.0154624999999839 94.679975|23|17|5|12|95.98045|23|17|5|12|95.720355|0.0494499999999931|0.0114750000000043 94.666525|23|17|4|12|95.9749|23|17|4|12|95.713225|0.0360000000000014|0.00592500000000484 94.630525|23|17|2|12|95.968975|23|17|2|12|95.701285|0.0|0.0 ```

ksahlin commented 9 months ago

Really great that you are performing this analysis!

I am happy to change default settings for 50, 75, 100.

For longer read lengths, I think we need to consider eventual speed tradeoffs. While having e.g. +0.020 improvement is nice, it would perhaps not be worth it if runtime increases with more than 10%.

As for the points 1-4:

Ok, sounds good.
I would prefer the 300* option to allow smaller seeds - this matters in SV regions. In general we should be careful optimizing towards only longer seeds, because SVs will require some short seeds.
Ok.
Ok.

marcelm commented 9 months ago

All the runs have finished and I have updated the table to include results for read length 200 and 500. Even though the improvements for the larger read lengths are small (and we may not want to make these changes), I’m glad we have the numbers now because we now know that the current parameters are already quite good.

Regarding canonical read length 125: I think that if you are fine with changing the parameters for 100, we should also change them for 125. If I remember correctly, we set these to be an interpolation between 100 and 150, and we should just re-do the interpolation.

Here’s how the suggested changes impact rescue counts and similar.

Read length 200

Here is how rescue etc. numbers change for CHM13:

What	Current (20, 16, 4, 13)	Suggested (22, 18, 4, 14)
Total mapping sites tried:	1804922	1788317
Total calls to ssw:	194030	184665
Inconsistent NAM ends:	3407	2966
Tried NAM rescue:	87835	80766

The picture is the same for the other three datasets. Runtime doesn’t change measurably.

Read length 300

Exactly the same thing happens as for read length 200: Mapping sites tried, SSW calls, NAM rescue all go down, only inconsistent NAMs goes up.

CHM13 in detail:

What	Current (22, 18, 2, 12)	Suggested (22, 18, 3, 13)
Total mapping sites tried:	1784220	1778315
Total calls to ssw:	218392	212496
Inconsistent NAM ends:	1876	1898
Tried NAM rescue:	80865	80351

Again, no measurable runtime difference.

Read length 500

Same thing for all four datasets.

ksahlin commented 9 months ago

Very good. I agree with most your updates but just wanted to ask some specific things before merge.

Having (some) smaller seeds will help supplementary alignments (split alignments), thus, keeping/reducing l is of some interest. If you have the measurements easily avialable/stored:

Read length 100: how does (20, 16, -2, -1) compare to baseline and to current best? If comparable, I would opt for (20, 16, -2, -1) to keep lower bound.
Read length 125: perhaps interpolate to (20, 16, -1, 3)?
Read length 400: perhaps worth keeping l, so (23, 17, 2, 11).
~~I didn't see an updated table with measurements for 200 and 500 included.~~

Edit: I found 200 and 500 (canonical classes).

marcelm commented 9 months ago

Having (some) smaller seeds will help supplementary alignments (split alignments), thus, keeping/reducing l is of some interest. If you have the measurements easily avialable/stored:

Read length 125: perhaps interpolate to (20, 16, -1, 3)?

Sure, updated.

Read length 400: perhaps worth keeping l, so (23, 17, 2, 11).

For this parameter combination, PE drosophila accuracy was lower than baseline (dropped from 95.6773 to 95.66765), which is why it was excluded from further measurements. (That is, I have no numbers for CHM13, maize, rye.)

These are the only options that do not reduce accuracy for at least one of the datasets:

k	s	l	u	PE	SE	SE diff	PE diff
23	17	5	11	94.670225	95.9844375	+0.0397	+0.015462
23	17	5	12	94.679975	95.98045	+0.0494	+0.011475
23	17	4	12	94.666525	95.9749	+0.0360	+0.005925
23	17	2	12	94.630525	95.968975	0.0	0.0

Read length 100: how does (20, 16, -2, -1) compare to baseline and to current best? If comparable, I would opt for (20, 16, -2, -1) to keep lower bound.

This parameter combination was excluded because maize SE accuracy dropped from 70.5148 to 70.438.

I didn't see an updated table with measurements for 200 and 500 included.

I meant that I measured actual read lengths 200 and 500, but put them into the rows for canonical read length 250 and 400.

Judging from your comments, I see that you are a bit worried about overfitting and I totally agree. I think the updated settings for read lengths 50 and 75 should go in, but we can also just leave the rest alone. The optimization criterion is not accuracy, but something else that is not reflected in our test data. Until it is, we need to rely on your intuition.

ksahlin commented 9 months ago

Yes I am a bit worried about overfitting, but most parameter changes in the table would probably not affect supplementary alignments, some would even be beneficial (50-125).

Based on your comments, and, until we look into suppl. alignment, I suggest implementing the parameter changes you propose from your summary table for all datasets except canonical class 400. The proposed change for the 400 class is the only parameter change with a larger increase in l.

Do you agree?

marcelm commented 9 months ago

Agreed! Will update the PR appropriately and merge it later.

marcelm commented 9 months ago

With this merged, I think it’s time for a release. Do you want to run your evaluations on commit 9b95973fbe1766f89fdbf5a3aacdb4435304d987?

ksahlin commented 9 months ago

Great, will do. I'll get back to you about when. Hopefully tonight or tomorrow .

ksahlin commented 9 months ago

I pulled and compiled https://github.com/ksahlin/strobealign/commit/9b95973fbe1766f89fdbf5a3aacdb4435304d987 and ran it on the larger simulated datasets (called v0.12.0 in plots). Plots attached. I am a bit puzzled by the results. In summary:

The parameter setting for 50 seems to be strictly better in accuracy (except the percentage of aligned reads going down)
Most other parameter settings decrease accuracy, not by a lot but e.g. consistently with 0.1 percent point on nearly all CHM13 datasets above 100 in read length (see accuracy zoomed plot).
The memory usage for rye has gone up (??).
Have a look at the percentage of aligned reads - it has gone down drastically (except on drosophila) for shorter reads.

Given this, it seems like we cannot integrate any of the parameters until we know what is going on. The accuracy was computed with my original scripts (not jaccard).

I double and triple checked that I ran the right commit

$ strobealign-9b95973
  strobealign reference [reads1] [reads2] {OPTIONS}

    strobelign 0.11.0-103-g9b95973

accuracy_plot.pdf accuracy_plot_zoomed.pdf time_plot.pdf memory_plot.pdf percentage_aligned_plot.pdf

(All datasets in this analysis are PE reads)

ksahlin commented 9 months ago

What seems particularly strange is that the parameters for 150bp reads did not change in the commit from what I understand. I therefore wonder whether the decrease could be caused by some other commit since v0.11.0? I am not sure which the baseline commit that you benchmark against is.

Also, just mentioning that I simulate reads with mason using the following variant frequency w.r.t. to references: mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50. How did you simulate your reads? This would not explain the discrepancy for the 150bp dataset though.

marcelm commented 9 months ago

Yes, I noticed as well that some of the changes cannot be caused by the new parameters. I’m looking at old commits to find out what is going on.

I did not simulate reads myself but use the first 1 M reads of your datasets.

marcelm commented 9 months ago

I’m looking at the memory consumption at the moment. What are the absolute values for rye, read length 50? For me, memory usage for 0.11.0 is 35.3 GB and 36.1 GB for main. So there’s a slight increase, but not that much (I’ll try to find out which commit caused this). Is it possible that the plot doesn’t show results for 0.11.0, but for some commit before the change to 64 bit randstrobe indices? That did increase memory usage from 33 GB to 35.3 GB.

ksahlin commented 9 months ago

Yes you are right. In the plot labelled as v0.11.0_cmake I was running strobealign-e0764b6. (https://github.com/ksahlin/strobealign/commit/e0764b6a24f05f9b6d0853286af93080a3b0452b) . It says in the commit that we have already changed to 64bits at that point.

ksahlin commented 9 months ago

Below I have pasted the exact values

tool,dataset,read_length,time,memory
strobealign_v071,rye,50,646.7549999999999,50.286916
strobealign_v071_map,rye,50,443.653,50.286916
strobealign_v080_conda,rye,50,696.6300000000001,48.79224
strobealign_v080_cmake,rye,50,862.1099999999999,48.792032
strobealign_v080_cmake_map,rye,50,697.26,48.791976
strobealign_v090_cmake,rye,50,824.16,48.792368
strobealign_v0110_cmake,rye,50,734.5899999999999,33.727584
strobealign_v0110_cmake_map,rye,50,564.01,33.727556
strobealign_v0120_cmake,rye,50,780.1,36.969792
strobealign_v0120_cmake_map,rye,50,621.8399999999999,36.969716

These are peak (RSS) reported during alignment with /usr/bin/time -v

ksahlin commented 9 months ago

That did increase memory usage from 33 GB to 35.3 GB.

Otherwise my measurments seem to agree well with what you see.

I have also posted CHM15 50nt below

strobealign_v071,CHM13,50,205.613,31.445908
strobealign_v071_map,CHM13,50,143.928,31.086756
strobealign_v080_conda,CHM13,50,209.55,23.96776
strobealign_v080_cmake,CHM13,50,224.95000000000002,23.967372
strobealign_v080_cmake_map,CHM13,50,185.74,23.967352
strobealign_v090_cmake,CHM13,50,246.75000000000003,23.967708
strobealign_v0110_cmake,CHM13,50,219.26000000000002,14.550108
strobealign_v0110_cmake_map,CHM13,50,170.55999999999997,14.393092
strobealign_v0120_cmake,CHM13,50,243.62,15.284884
strobealign_v0120_cmake_map,CHM13,50,202.77,15.241692

marcelm commented 9 months ago

Did you change the number of threads?

ksahlin commented 9 months ago

Nope, I am running /usr/bin/time -v strobealign-9b95973 -t 16 -o {tmp_sam} {input.ref} {tmp_fq_L} {tmp_fq_R} as with all previous commits.

Details are here: https://github.com/ksahlin/alignment_evaluation/blob/master/evaluation_genomes/Snakefile

marcelm commented 9 months ago

I found multiple commits that change memory usage for rye.

memory usage (GiB)	Commit	What
30.2		Baseline (new index structure, before switching to 64 bit bucket indices)
32.1	bdf4012	Switch to 64 bit bucket indices
33.3	6c67bfc	Use xxh64 to hash the s-mer
35.2	de58725	Compute strobemers in parallel

Commit 6c67bfc (hashing s-mers) needs more RAM because more strobemers are produced. Before:

Estimated total memory usage: 33.67 GB
Total strobemers:        1380324847

After 6c67bfc:

Total strobemers:        1454354777
Estimated total memory usage: 34.85 GB

(I was confused by the estimate being too high, but I now realize that the estimate uses 1 GB = $10^9$ bytes while the measurements use 1 GiB = $2^{30}$ bytes.)

I’ll need to investigate what happens in de58725.

marcelm commented 9 months ago

Note for later: The decrease in accuracy is due to commit 91f2223, likely due to nam_ids being re-used (and colliding). Commit 4ac1e1b bumps accuracy back to the previous level, need to turn this into a proper fix.

ksahlin commented 9 months ago

Note for later: The decrease in accuracy is due to commit https://github.com/ksahlin/strobealign/commit/91f22234baf192a3167354a7f2bbfefb49cb2162, likely due to nam_ids being re-used (and colliding). Commit https://github.com/ksahlin/strobealign/commit/4ac1e1b8253d2c756ba59714301d97368635f401 bumps accuracy back to the previous level, need to turn this into a proper fix.

I see. That’s good news. I was worried it was gonna be https://github.com/ksahlin/strobealign/commit/6c67bfca9e4898064ac7ad907b31bcfee2abb85a which would be discouraging since, theoretically, hash s-mers should be the right thing to do.

marcelm commented 9 months ago

As you saw, I submitted #350 to fix the accuracy issue.

The only remaining thing is the decreased mapping rate. This is caused by commit 0239650, which is the commit that applies the tuned parameters.

In my opinion, the decrease in mapping rate is a good thing since the reads that are now no longer mapped were incorrect anyway.

The accuracy is currently the percentage of all reads that were mapped correctly. With this, the aim should be to get the mapping rate close to the accuracy.

If we changed it such that accuracy is the percentage of mapped reads that were mapped correctly, which would also make sense, then accuracy would go up even more by the tuned parameters.

I guess what I’m saying is: We should be glad precision goes up, but the plots don’t make that clear.

(Hm, thinking about whether the precision/recall terminology is appropriate here, it appears to me we don’t have any true negatives, but maybe that’s not necessary.)

I also wonder whether it makes sense to somehow evaluate multimappers differently when computing accuracy. A read that maps equally well to two or more places should still be counted as somewhat correct when it’s assigned to one of those positions, even if that wasn’t the where it came from.

ksahlin commented 9 months ago

Interesting phenomenon. I had not anticipated this. While high precision is good, here is my take:

While I generally agree that accuracy (as measured by tot_correct_mapped/tot_reads) is the metric to primarily optimise for, there are situations (perfect repeats) where more mapped reads are good such as for coverage based applications. While I don't have data to support, it is likely that most of the loss in mapped reads was to repetitive regions (including perfect repeats).

If could be that much smaller seeds lead to many hits going beyond the hard threshold (1000?) and simply become unmapped (strobealign gives up). While it is okay to take some loss in percentage mapped, it is worth monitoring. The drop we saw is probably acceptable (borderline) given the increase in accuracy.

I am not sure why the decline is so big even for 100nt reads. I will try the next commit (after your optimization) and see the final damage in percentage mapped at that point.

So the accuracy bug does not affect the percentage mapped reads? (I can probably answer this myself in my next evaluation pass after your optimization)

marcelm commented 9 months ago

So the accuracy bug does not affect the percentage mapped reads? (I can probably answer this myself in my next evaluation pass after your optimization)

The percentage of mapped reads was not affected by the accuracy bug (NAM id collisions). It changed only because of the updated parameter settings.

ksahlin commented 9 months ago

I took the liberty of running commit 1587954 (current master). New plots attached. The memory bug seems fixed, but note still slightly higher on rye compared to commit https://github.com/ksahlin/strobealign/commit/e0764b6a24f05f9b6d0853286af93080a3b0452b. Also, why don't I have the same absolute numbers as you in terms of memory usage(?), hm. Is it something to do with that I'm running 16 threads?

Accuracy is better for 50nt PE reads, slightly better for 75 and nearly unchanged for remaining read lenghts (a tad worse for 500nt). Quite a hit in number of unaligned reads as well as we discussed in https://github.com/ksahlin/strobealign/issues/354. Worth to investigate further.

accuracy_plot.pdf accuracy_plot_zoomed.pdf time_plot.pdf percentage_aligned_plot.pdf memory_plot.pdf

ksahlin / strobealign

Optimize parameters #345

To Do

Read length 200

Read length 300

Read length 500