Closed marcelm closed 9 months ago
Great that you found it and fixed it so quickly!
Yes, a bit painful to rerun but exciting to see what the new optimization will arrive at. I guess the numbers reported above are close to minimum improvement we could arrive at (if you use 0.25SE + 0.75PE as objective function).
Approved btw. If we want to take the new accuracy numbers discussion somewhere else.
We can take the discussion here.
Here are the new numbers for read length 50, 75, 100 and 150. For them, the same parameter settings as before maximize accuracy.
For canonical read length 250, the best parameters are now (23, 19, 4, 14) (marked 250* in the table below).
I copied the table from #345 and updated it. The single-end numbers don’t change. I haven’t had time to understand why that is the case.
Canonical read length | Current parameters | suggested parameters | SE improvement | PE improvement |
---|---|---|---|---|
50 | (20, 16, -3, 2) | (18, 14, -2, -2) | +2.478 | |
75 | (20, 16, -3, 2) | (20, 16, -2, -2) | +0.185 | |
100 | (20, 16, -2, 2) | (20, 16, -1, -1) | +0.070 | |
150 | (20, 16, 1, 7) | (20, 16, 1, 7) | already best | |
250 | (20, 16, 4, 13) | (22, 18, 4, 14) | +0.028 | |
250* | (23, 19, 4, 14) | +0.049 | +0.025 | |
300 | (22, 18, 2, 12) | (22, 18, 3, 13) | +0.033 |
For read length 300, there’s a whole bunch of better parameter settings. The best is (24, 20, 6, 13), but this increases $l$ quite a bit. I’ll just dump the list here for now:
k | s | l | u | SE improvement | PE improvement | comment |
---|---|---|---|---|---|---|
24 | 20 | 6 | 13 | +0.055 | +0.036 | |
23 | 19 | 8 | 13 | +0.043 | +0.032 | |
23 | 19 | 5 | 13 | +0.055 | +0.028 | |
24 | 20 | 4 | 13 | +0.054 | +0.028 | |
23 | 19 | 6 | 12 | +0.042 | +0.031 | |
23 | 19 | 6 | 13 | +0.053 | +0.025 | |
22 | 18 | 6 | 12 | +0.038 | +0.029 | |
23 | 19 | 4 | 12 | +0.036 | +0.028 | |
22 | 18 | 6 | 13 | +0.046 | +0.025 | |
24 | 20 | 2 | 13 | +0.031 | +0.028 | |
23 | 19 | 6 | 11 | +0.027 | +0.027 | |
24 | 20 | 4 | 12 | +0.034 | +0.025 | |
23 | 19 | 5 | 11 | +0.024 | +0.024 | |
23 | 19 | 7 | 11 | +0.018 | +0.023 | |
22 | 18 | 3 | 13 | +0.033 | +0.018 | chosen in #345 |
21 | 17 | 6 | 13 | +0.021 | +0.02 | |
23 | 19 | 2 | 13 | +0.02 | +0.02 | |
21 | 17 | 7 | 13 | +0.022 | +0.019 | |
23 | 19 | 3 | 13 | +0.038 | +0.014 | |
22 | 18 | 5 | 13 | +0.04 | +0.012 | |
22 | 18 | 4 | 12 | +0.024 | +0.015 | |
23 | 19 | 4 | 11 | +0.026 | +0.014 | |
22 | 18 | 4 | 13 | +0.032 | +0.011 | |
23 | 19 | 3 | 12 | +0.017 | +0.012 | |
22 | 18 | 2 | 13 | +0.019 | +0.008 | |
22 | 18 | 2 | 12 | 0.0 | 0.0 | current (baseline) |
All of these numbers are relatively small, so maybe it doesn’t matter too much.
I’ve updated my previous comment. It now reports numbers for all read length except 400 (which is still running and which you didn’t want to change anyway).
Okay, I don't have any idea why results would be the same for SE either. I am okay with the suggestions above, but remain skeptical at increasing l
too much for the read lengths tested (for longer reads it might be ok). I think the current suggested 300 setting is okay.
For another discussion: I wanted to check if you ever tested different q
? I was thinking if we should remove the whole masking and Bitcount() it since it uses popcount which seems expensive (I can show you some results from another study offline). q
and Bitcount()
were initially introduced to skew the sampling towards shorter seeds, which positively affected the shortest reads (=< 100ish), but with above settings, read lengths 50-100 does not even need this computation.
Let me know when you have a new commit to test.
Okay, I don't have any idea why results would be the same for SE either. I am okay with the suggestions above, but remain skeptical at increasing
l
too much for the read lengths tested (for longer reads it might be ok). I think the current suggested 300 setting is okay.
Can you confirm that parameter settings for canonical read length 300 should remain (22, 18, 3, 13) as decided upon in #345? (Just asking because #345 is already merged, which makes "current suggested" ambiguous.)
For another discussion: I wanted to check if you ever tested different
q
?
No, can you open a new issue for that?
Can you confirm that parameter settings for canonical read length 300 should remain (22, 18, 3, 13) as decided upon in https://github.com/ksahlin/strobealign/pull/345?
Yes, I confirm that (22, 18, 3, 13)
is good for 300.
No, can you open a new issue for that?
Will do.
The nam_id started from 0 both for forward and reverse hits, which resulted in NAM ids being reused for different NAMs. This regression was introduced by commit 91f2223 and reduced accuracy.
This PR brings accuracy back to the expected levels (that is, higher than v0.11.0 due to tuned parameters).
I will need to re-run the parameter evaluation ...
See #345
Comparison
for a couple of datasets