Closed marcelm closed 9 months ago
It turns out good single-end accuracy does not necessarily imply good paired-end accuracy, so here is an optimization round only on paired-end data:
k | s | l | u | accuracy |
---|---|---|---|---|
22 | 18 | -2 | -2 | 86.833475 |
23 | 19 | -2 | -2 | 86.8255125 |
21 | 17 | -2 | -2 | 86.818425 |
20 | 16 | -2 | -2 | 86.7994125 (suggested) |
19 | 15 | -1 | -1 | 86.7724 |
23 | 19 | -3 | -3 | 86.76275 |
18 | 14 | -1 | -1 | 86.7576875 |
21 | 17 | -1 | -1 | 86.734175 |
22 | 18 | -3 | -3 | 86.732175 |
22 | 18 | -1 | -1 | 86.7238125 |
20 | 16 | -3 | 2 | 86.511525 (baseline) |
Now the question is which parameter combination to pick because it is going to be a tradeoff between SE and PE accuracy. (We could select parameters depending on whether SE or PE data is mapped, but I’d prefer not to go down that route at the moment. Also, we actually support mixed input (SE+PE).)
Because PE accuracy is maybe more important than SE, the following table ranks the parameter combinations by PE accuracy plus 25% of SE accuracy:
k | s | l | u | PE | SE |
---|---|---|---|---|---|
20 | 16 | -2 | -2 | 86.7994125 | 74.947225 |
19 | 15 | -1 | -1 | 86.7724 | 74.988925 |
23 | 19 | -3 | -3 | 86.76275 | 75.0243 |
18 | 14 | -1 | -1 | 86.7576875 | 74.9946 |
22 | 18 | -3 | -3 | 86.732175 | 75.003925 |
21 | 17 | -3 | -3 | 86.6811875 | 74.988875 |
18 | 14 | -1 | 0 | 86.6850375 | 74.8734 |
20 | 16 | -3 | -3 | 86.643825 | 74.936225 |
22 | 18 | -3 | -2 | 86.6519875 | 74.882975 |
21 | 17 | -3 | -2 | 86.6393125 | 74.891725 |
Picking $k=20$, $s=16$, $l=u=-2$ would improve SE accuracy by 0.3 and PE accuracy by 0.18. (And it’s similar to what we have now.)
I’m bit suprised many of these good parameter combinations have $l=u$. That isn’t really using "rand"strobes anymore. Not sure what this implies. Is this just something with the test data?
We could select parameters depending on whether SE or PE data is mapped, but I’d prefer not to go down that route at the moment. Also, we actually support mixed input (SE+PE).
I agree
It could be that (20,16,-2,-2) invokes more rescue and is therefore slower (not suggesting you to do further benchmarks). Given that (20,16,-2,-2) and (23, 19, -3, -3) gives near identical increase in accuracy for PE, I would be okay picking (23, 19, -3, -3) (but no strong preference).
I’m bit suprised many of these good parameter combinations have l=u That isn’t really using "rand"strobes anymore. Not sure what this implies. Is this just something with the test data?
I don't think any of the parameter settings above, except (18, 14, -1, 0), yield 'randstrobes' (in the way they were first intended to be designed, i.e., with a gap in the middle), because the offset of 1, 2,or at most 3 syncmers downstream for the first syncmer rarely make it beyond k nt downstream. Our parameter settings above instead generates flexible sized k-mers.
Also, strobemers were originally intended for longer reads and are not expected to work well for very short reads. In the strobealign paper I argue that 2*150nt reads is where we really start to see benefit over e.g., k-mer based tools. What we are seeing in above analysis (that short exact k-mers are working better for <100nt reads) is expected.
Not sure what this implies. Is this just something with the test data?
Another way to answer this: I think for 50nt and 75nt reads, it helps having short seeds, simply to pack in as many valid seeds as possible in the hope that at least one matches in case of an error. There aren't really any 'room' to create longer randomly spaced seeds.
Coming back to this, I would now even have slight preference for the setting (23, 19, -3, -3). This is because (23, 19, -3, -3) generates flexible k-mers with min size 26 and mean 28, while (20, 16, -2, -2) generates k-mers with min 26 and mean 30. The larger end of the distribution is also more likely to be lower for the first setting due to the whole distribution is pushed towards smaller values.
Not having to do as many rescue searches is probably favourable in speed and possibly biological data is more noisy. I should do a full benchmark when we have a candidate commit for release 0.12.0.
Ok, let’s use (23, 19, -3, -3). I’ll measure the no. of rescues and runtime and will add that information when I open the PR.
And because the script for optimizing parameters is now written anyway, I also ran it on read length 100. Here are the best settings in comparison to the current ones:
k | s | l | u | avg. accuracy |
---|---|---|---|---|
20 | 16 | -1 | -1 | 84.79074375 |
19 | 15 | 0 | 0 | 84.78106875 |
23 | 19 | -2 | -2 | 84.77730625 |
18 | 14 | 0 | 0 | 84.769125 |
22 | 18 | -2 | -2 | 84.763125 |
21 | 17 | -2 | -1 | 84.7377 |
20 | 16 | -2 | 2 | 84.68755625 (current) |
Shall I change read length 100 to (23, 19, -2, -2)? SE accuracy would go from 79.943775 to 80.0287, and PE accuracy from 89.4313375 to 89.5259125. Would make sense to just increment l and u by one compared to the new parameters for read length 75.
And I’ll run the remaining lengths as well. This is almost no effort now, just need to wait a bit.
Great, I agree that we can use (23, 19, -2, -2) for the 100nt canoncal class.
And I’ll run the remaining lengths as well. This is almost no effort now, just need to wait a bit.
Awesome! Keep in mind that for the longer read lengths (I predict that) there will be more time-accuracy tradeoffs. Even though we may still see good accuracy with 'flexible size' k-mers (through randstrobes method), we would have to assess more in detail the runtime to longer 'real' randstrobes (with gap).
The reason I am not to worried assessing runtime tradeoffs for the very short reads (50 and 75nt) is because longer seeds leads to so much unnecessary rescuing that the short seeds may even be beneficial for runtime (i.e. a win-win).
It turns out the number of rescues actually goes up with (23, 19, -3, -3) in three out of four datasets. From the logs for maize paired ends:
(20, 16, -2, -2):
Tried NAM rescue: 845713
Mates rescued by alignment: 2741107
Total calls to ssw: 4414486
User time (seconds): 1045.74
(23, 19, -3, -3):
Tried NAM rescue: 901729
Mates rescued by alignment: 2590267
Total calls to ssw: 4459521
User time (seconds): 1092.19
Only in Drosophila, the "NAM rescues" are lower with the new setting, but even there, the number of SSW alignments goes up. Runtime cannot be trusted that much, but the trend seems to be that it is slightly higher for (23, 19, -3, -3).
What do you think, do you want to stay with (23, 19, -3, -3)?
Interesting. I guess we should go for (20, 16, -2, -2) then, if it is a bit faster and has slightly better PE accuracy. If you agree?
Just out of curiosity, did you check how much it increases/decreases rescue/runtime over baseline?
In terms of "Total calls to ssw", (20, 16, -2, -2) has slightly fewer than the baseline for rye and CHM13 and slightly more for maize and drosophila (~5%).
I see, so difficult to predict any clear trends. Nevertheless, it seems favourable to baseline.
As discussed in #272, it may be appropriate to add 75 as a new canonical read length.
I did an extensive search for a good set of parameters that improves accuracy over the existing one.
Reads of lengths 75 currently use the same parameters as reads of lengths 50, which are: $k=20$, $s=16$, $l=-3$, $u=2$. The accuracy we get using these parameters is our baseline.
I tested the following values for the parameters:
The script I wrote only tests single-end data. The assumption is that good single-end performance also means good paired-end performance. The script optimizes accuracy averaged over all four genomes, with one heuristic: It only accepts parameter combinations where the accuracy is better than the baseline for all four genomes. This speeds it up: When it tests a combination, it tests the genomes one by one and can stop as soon as the obtained accuracy is lower than the baseline. By starting with the smallest genome (Drosophila), often the bigger genomes don’t need to be tested at all.
The parameter combinations with $s=k-2$ performed really well, but this comes at a cost of increased memory usage, so I report results for both separately.
Here are the top 10 parameter combinations and the baseline for $s=k-4$:
Here are the top 10 parameter combinations if we allow $s=k-2$:
My feeling is that this additional increase is not worth the cost in memory, at least not for the default settings (sampling rate increases from 1/5 to 1/3, i.e., 67% more randstrobes).
I am running the above combinations on paired-end data at the moment and will report the results later.