ksahlin / strobealign

Aligns short reads using dynamic seed size with strobemers
MIT License
128 stars 16 forks source link

A new profile for read length 75 #339

Closed marcelm closed 9 months ago

marcelm commented 9 months ago

As discussed in #272, it may be appropriate to add 75 as a new canonical read length.

I did an extensive search for a good set of parameters that improves accuracy over the existing one.

Reads of lengths 75 currently use the same parameters as reads of lengths 50, which are: $k=20$, $s=16$, $l=-3$, $u=2$. The accuracy we get using these parameters is our baseline.

I tested the following values for the parameters:

parameter values
k 17, ..., 23
s $k-4$, $k-2$ (but see below)
l -4, ..., 2
u $l$, ..., 2

The script I wrote only tests single-end data. The assumption is that good single-end performance also means good paired-end performance. The script optimizes accuracy averaged over all four genomes, with one heuristic: It only accepts parameter combinations where the accuracy is better than the baseline for all four genomes. This speeds it up: When it tests a combination, it tests the genomes one by one and can stop as soon as the obtained accuracy is lower than the baseline. By starting with the smallest genome (Drosophila), often the bigger genomes don’t need to be tested at all.

The parameter combinations with $s=k-2$ performed really well, but this comes at a cost of increased memory usage, so I report results for both separately.

Here are the top 10 parameter combinations and the baseline for $s=k-4$:

k s l u avg. accuracy diff to baseline
23 19 -3 -3 75.0243 +0.262525
22 18 -3 -3 75.003925 +0.24215
18 14 -1 -1 74.9946 +0.232825
19 15 -1 -1 74.988925 +0.22715
21 17 -3 -3 74.988875 +0.22710
17 13 -1 -1 74.976225 +0.21445
20 16 -2 -2 74.947225 +0.18545
20 16 -3 -3 74.936225 +0.17445
21 17 -3 -2 74.891725 +0.12995
20 16 -3 -2 74.88965 +0.127875
20 16 -3 2 74.761775 0 (baseline)

Here are the top 10 parameter combinations if we allow $s=k-2$:

k s l u avg. accuracy diff to baseline
18 16 -3 -3 75.236575 +0.4748
19 17 -3 -3 75.2342 +0.472425
20 18 -3 -3 75.233575 +0.4718
17 15 -2 -1 75.225825 +0.46405
20 18 -4 -3 75.2253 +0.463525
19 17 -3 -2 75.22265 +0.460875
17 15 -1 -1 75.220625 +0.45885
18 16 -2 -2 75.21785 +0.456075
18 16 -3 -2 75.21635 +0.454575
19 17 -4 -3 75.213925 +0.45215

My feeling is that this additional increase is not worth the cost in memory, at least not for the default settings (sampling rate increases from 1/5 to 1/3, i.e., 67% more randstrobes).

I am running the above combinations on paired-end data at the moment and will report the results later.

marcelm commented 9 months ago

It turns out good single-end accuracy does not necessarily imply good paired-end accuracy, so here is an optimization round only on paired-end data:

k s l u accuracy
22 18 -2 -2 86.833475
23 19 -2 -2 86.8255125
21 17 -2 -2 86.818425
20 16 -2 -2 86.7994125 (suggested)
19 15 -1 -1 86.7724
23 19 -3 -3 86.76275
18 14 -1 -1 86.7576875
21 17 -1 -1 86.734175
22 18 -3 -3 86.732175
22 18 -1 -1 86.7238125
20 16 -3 2 86.511525 (baseline)

Now the question is which parameter combination to pick because it is going to be a tradeoff between SE and PE accuracy. (We could select parameters depending on whether SE or PE data is mapped, but I’d prefer not to go down that route at the moment. Also, we actually support mixed input (SE+PE).)

Because PE accuracy is maybe more important than SE, the following table ranks the parameter combinations by PE accuracy plus 25% of SE accuracy:

k s l u PE SE
20 16 -2 -2 86.7994125 74.947225
19 15 -1 -1 86.7724 74.988925
23 19 -3 -3 86.76275 75.0243
18 14 -1 -1 86.7576875 74.9946
22 18 -3 -3 86.732175 75.003925
21 17 -3 -3 86.6811875 74.988875
18 14 -1 0 86.6850375 74.8734
20 16 -3 -3 86.643825 74.936225
22 18 -3 -2 86.6519875 74.882975
21 17 -3 -2 86.6393125 74.891725

Picking $k=20$, $s=16$, $l=u=-2$ would improve SE accuracy by 0.3 and PE accuracy by 0.18. (And it’s similar to what we have now.)

marcelm commented 9 months ago

I’m bit suprised many of these good parameter combinations have $l=u$. That isn’t really using "rand"strobes anymore. Not sure what this implies. Is this just something with the test data?

ksahlin commented 9 months ago

We could select parameters depending on whether SE or PE data is mapped, but I’d prefer not to go down that route at the moment. Also, we actually support mixed input (SE+PE).

I agree

It could be that (20,16,-2,-2) invokes more rescue and is therefore slower (not suggesting you to do further benchmarks). Given that (20,16,-2,-2) and (23, 19, -3, -3) gives near identical increase in accuracy for PE, I would be okay picking (23, 19, -3, -3) (but no strong preference).

I’m bit suprised many of these good parameter combinations have l=u That isn’t really using "rand"strobes anymore. Not sure what this implies. Is this just something with the test data?

I don't think any of the parameter settings above, except (18, 14, -1, 0), yield 'randstrobes' (in the way they were first intended to be designed, i.e., with a gap in the middle), because the offset of 1, 2,or at most 3 syncmers downstream for the first syncmer rarely make it beyond k nt downstream. Our parameter settings above instead generates flexible sized k-mers.

Also, strobemers were originally intended for longer reads and are not expected to work well for very short reads. In the strobealign paper I argue that 2*150nt reads is where we really start to see benefit over e.g., k-mer based tools. What we are seeing in above analysis (that short exact k-mers are working better for <100nt reads) is expected.

ksahlin commented 9 months ago

Not sure what this implies. Is this just something with the test data?

Another way to answer this: I think for 50nt and 75nt reads, it helps having short seeds, simply to pack in as many valid seeds as possible in the hope that at least one matches in case of an error. There aren't really any 'room' to create longer randomly spaced seeds.

ksahlin commented 9 months ago

Coming back to this, I would now even have slight preference for the setting (23, 19, -3, -3). This is because (23, 19, -3, -3) generates flexible k-mers with min size 26 and mean 28, while (20, 16, -2, -2) generates k-mers with min 26 and mean 30. The larger end of the distribution is also more likely to be lower for the first setting due to the whole distribution is pushed towards smaller values.

Not having to do as many rescue searches is probably favourable in speed and possibly biological data is more noisy. I should do a full benchmark when we have a candidate commit for release 0.12.0.

marcelm commented 9 months ago

Ok, let’s use (23, 19, -3, -3). I’ll measure the no. of rescues and runtime and will add that information when I open the PR.


And because the script for optimizing parameters is now written anyway, I also ran it on read length 100. Here are the best settings in comparison to the current ones:

k s l u avg. accuracy
20 16 -1 -1 84.79074375
19 15 0 0 84.78106875
23 19 -2 -2 84.77730625
18 14 0 0 84.769125
22 18 -2 -2 84.763125
21 17 -2 -1 84.7377
20 16 -2 2 84.68755625 (current)

Shall I change read length 100 to (23, 19, -2, -2)? SE accuracy would go from 79.943775 to 80.0287, and PE accuracy from 89.4313375 to 89.5259125. Would make sense to just increment l and u by one compared to the new parameters for read length 75.

And I’ll run the remaining lengths as well. This is almost no effort now, just need to wait a bit.

ksahlin commented 9 months ago

Great, I agree that we can use (23, 19, -2, -2) for the 100nt canoncal class.

And I’ll run the remaining lengths as well. This is almost no effort now, just need to wait a bit.

Awesome! Keep in mind that for the longer read lengths (I predict that) there will be more time-accuracy tradeoffs. Even though we may still see good accuracy with 'flexible size' k-mers (through randstrobes method), we would have to assess more in detail the runtime to longer 'real' randstrobes (with gap).

The reason I am not to worried assessing runtime tradeoffs for the very short reads (50 and 75nt) is because longer seeds leads to so much unnecessary rescuing that the short seeds may even be beneficial for runtime (i.e. a win-win).

marcelm commented 9 months ago

It turns out the number of rescues actually goes up with (23, 19, -3, -3) in three out of four datasets. From the logs for maize paired ends:

(20, 16, -2, -2):

Tried NAM rescue: 845713
Mates rescued by alignment: 2741107
Total calls to ssw: 4414486
User time (seconds): 1045.74

(23, 19, -3, -3):

Tried NAM rescue: 901729
Mates rescued by alignment: 2590267
Total calls to ssw: 4459521
User time (seconds): 1092.19

Only in Drosophila, the "NAM rescues" are lower with the new setting, but even there, the number of SSW alignments goes up. Runtime cannot be trusted that much, but the trend seems to be that it is slightly higher for (23, 19, -3, -3).

What do you think, do you want to stay with (23, 19, -3, -3)?

ksahlin commented 9 months ago

Interesting. I guess we should go for (20, 16, -2, -2) then, if it is a bit faster and has slightly better PE accuracy. If you agree?

Just out of curiosity, did you check how much it increases/decreases rescue/runtime over baseline?

marcelm commented 9 months ago

In terms of "Total calls to ssw", (20, 16, -2, -2) has slightly fewer than the baseline for rye and CHM13 and slightly more for maize and drosophila (~5%).

ksahlin commented 9 months ago

I see, so difficult to predict any clear trends. Nevertheless, it seems favourable to baseline.