A new profile for read length 75

marcelm commented 9 months ago

As discussed in #272, it may be appropriate to add 75 as a new canonical read length.

I did an extensive search for a good set of parameters that improves accuracy over the existing one.

Reads of lengths 75 currently use the same parameters as reads of lengths 50, which are: $k=20$, $s=16$, $l=-3$, $u=2$. The accuracy we get using these parameters is our baseline.

I tested the following values for the parameters:

parameter	values
k	17, ..., 23
s	$k-4$, $k-2$ (but see below)
l	-4, ..., 2
u	$l$, ..., 2

The script I wrote only tests single-end data. The assumption is that good single-end performance also means good paired-end performance. The script optimizes accuracy averaged over all four genomes, with one heuristic: It only accepts parameter combinations where the accuracy is better than the baseline for all four genomes. This speeds it up: When it tests a combination, it tests the genomes one by one and can stop as soon as the obtained accuracy is lower than the baseline. By starting with the smallest genome (Drosophila), often the bigger genomes don’t need to be tested at all.

The parameter combinations with $s=k-2$ performed really well, but this comes at a cost of increased memory usage, so I report results for both separately.

Here are the top 10 parameter combinations and the baseline for $s=k-4$:

k	s	l	u	avg. accuracy	diff to baseline
23	19	-3	-3	75.0243	+0.262525
22	18	-3	-3	75.003925	+0.24215
18	14	-1	-1	74.9946	+0.232825
19	15	-1	-1	74.988925	+0.22715
21	17	-3	-3	74.988875	+0.22710
17	13	-1	-1	74.976225	+0.21445
20	16	-2	-2	74.947225	+0.18545
20	16	-3	-3	74.936225	+0.17445
21	17	-3	-2	74.891725	+0.12995
20	16	-3	-2	74.88965	+0.127875
20	16	-3	2	74.761775	0 (baseline)

Here are the top 10 parameter combinations if we allow $s=k-2$:

k	s	l	u	avg. accuracy	diff to baseline
18	16	-3	-3	75.236575	+0.4748
19	17	-3	-3	75.2342	+0.472425
20	18	-3	-3	75.233575	+0.4718
17	15	-2	-1	75.225825	+0.46405
20	18	-4	-3	75.2253	+0.463525
19	17	-3	-2	75.22265	+0.460875
17	15	-1	-1	75.220625	+0.45885
18	16	-2	-2	75.21785	+0.456075
18	16	-3	-2	75.21635	+0.454575
19	17	-4	-3	75.213925	+0.45215

My feeling is that this additional increase is not worth the cost in memory, at least not for the default settings (sampling rate increases from 1/5 to 1/3, i.e., 67% more randstrobes).

I am running the above combinations on paired-end data at the moment and will report the results later.

marcelm commented 9 months ago

It turns out good single-end accuracy does not necessarily imply good paired-end accuracy, so here is an optimization round only on paired-end data:

k	s	l	u	accuracy
22	18	-2	-2	86.833475
23	19	-2	-2	86.8255125
21	17	-2	-2	86.818425
20	16	-2	-2	86.7994125 (suggested)
19	15	-1	-1	86.7724
23	19	-3	-3	86.76275
18	14	-1	-1	86.7576875
21	17	-1	-1	86.734175
22	18	-3	-3	86.732175
22	18	-1	-1	86.7238125
20	16	-3	2	86.511525 (baseline)

Now the question is which parameter combination to pick because it is going to be a tradeoff between SE and PE accuracy. (We could select parameters depending on whether SE or PE data is mapped, but I’d prefer not to go down that route at the moment. Also, we actually support mixed input (SE+PE).)

Because PE accuracy is maybe more important than SE, the following table ranks the parameter combinations by PE accuracy plus 25% of SE accuracy:

k	s	l	u	PE	SE
20	16	-2	-2	86.7994125	74.947225
19	15	-1	-1	86.7724	74.988925
23	19	-3	-3	86.76275	75.0243
18	14	-1	-1	86.7576875	74.9946
22	18	-3	-3	86.732175	75.003925
21	17	-3	-3	86.6811875	74.988875
18	14	-1	0	86.6850375	74.8734
20	16	-3	-3	86.643825	74.936225
22	18	-3	-2	86.6519875	74.882975
21	17	-3	-2	86.6393125	74.891725

Picking $k=20$, $s=16$, $l=u=-2$ would improve SE accuracy by 0.3 and PE accuracy by 0.18. (And it’s similar to what we have now.)

marcelm commented 9 months ago

I’m bit suprised many of these good parameter combinations have $l=u$. That isn’t really using "rand"strobes anymore. Not sure what this implies. Is this just something with the test data?

ksahlin commented 9 months ago

We could select parameters depending on whether SE or PE data is mapped, but I’d prefer not to go down that route at the moment. Also, we actually support mixed input (SE+PE).

I agree

It could be that (20,16,-2,-2) invokes more rescue and is therefore slower (not suggesting you to do further benchmarks). Given that (20,16,-2,-2) and (23, 19, -3, -3) gives near identical increase in accuracy for PE, I would be okay picking (23, 19, -3, -3) (but no strong preference).

I’m bit suprised many of these good parameter combinations have l=u That isn’t really using "rand"strobes anymore. Not sure what this implies. Is this just something with the test data?

I don't think any of the parameter settings above, except (18, 14, -1, 0), yield 'randstrobes' (in the way they were first intended to be designed, i.e., with a gap in the middle), because the offset of 1, 2,or at most 3 syncmers downstream for the first syncmer rarely make it beyond k nt downstream. Our parameter settings above instead generates flexible sized k-mers.

Also, strobemers were originally intended for longer reads and are not expected to work well for very short reads. In the strobealign paper I argue that 2*150nt reads is where we really start to see benefit over e.g., k-mer based tools. What we are seeing in above analysis (that short exact k-mers are working better for <100nt reads) is expected.

ksahlin commented 9 months ago

Not sure what this implies. Is this just something with the test data?

Another way to answer this: I think for 50nt and 75nt reads, it helps having short seeds, simply to pack in as many valid seeds as possible in the hope that at least one matches in case of an error. There aren't really any 'room' to create longer randomly spaced seeds.

ksahlin commented 9 months ago

Coming back to this, I would now even have slight preference for the setting (23, 19, -3, -3). This is because (23, 19, -3, -3) generates flexible k-mers with min size 26 and mean 28, while (20, 16, -2, -2) generates k-mers with min 26 and mean 30. The larger end of the distribution is also more likely to be lower for the first setting due to the whole distribution is pushed towards smaller values.

Not having to do as many rescue searches is probably favourable in speed and possibly biological data is more noisy. I should do a full benchmark when we have a candidate commit for release 0.12.0.

marcelm commented 9 months ago

Ok, let’s use (23, 19, -3, -3). I’ll measure the no. of rescues and runtime and will add that information when I open the PR.

And because the script for optimizing parameters is now written anyway, I also ran it on read length 100. Here are the best settings in comparison to the current ones:

k	s	l	u	avg. accuracy
20	16	-1	-1	84.79074375
19	15	0	0	84.78106875
23	19	-2	-2	84.77730625
18	14	0	0	84.769125
22	18	-2	-2	84.763125
21	17	-2	-1	84.7377
20	16	-2	2	84.68755625 (current)

Shall I change read length 100 to (23, 19, -2, -2)? SE accuracy would go from 79.943775 to 80.0287, and PE accuracy from 89.4313375 to 89.5259125. Would make sense to just increment l and u by one compared to the new parameters for read length 75.

And I’ll run the remaining lengths as well. This is almost no effort now, just need to wait a bit.

ksahlin commented 9 months ago

Great, I agree that we can use (23, 19, -2, -2) for the 100nt canoncal class.

And I’ll run the remaining lengths as well. This is almost no effort now, just need to wait a bit.

Awesome! Keep in mind that for the longer read lengths (I predict that) there will be more time-accuracy tradeoffs. Even though we may still see good accuracy with 'flexible size' k-mers (through randstrobes method), we would have to assess more in detail the runtime to longer 'real' randstrobes (with gap).

The reason I am not to worried assessing runtime tradeoffs for the very short reads (50 and 75nt) is because longer seeds leads to so much unnecessary rescuing that the short seeds may even be beneficial for runtime (i.e. a win-win).

marcelm commented 9 months ago

It turns out the number of rescues actually goes up with (23, 19, -3, -3) in three out of four datasets. From the logs for maize paired ends:

(20, 16, -2, -2):

Tried NAM rescue: 845713
Mates rescued by alignment: 2741107
Total calls to ssw: 4414486
User time (seconds): 1045.74

(23, 19, -3, -3):

Tried NAM rescue: 901729
Mates rescued by alignment: 2590267
Total calls to ssw: 4459521
User time (seconds): 1092.19

Only in Drosophila, the "NAM rescues" are lower with the new setting, but even there, the number of SSW alignments goes up. Runtime cannot be trusted that much, but the trend seems to be that it is slightly higher for (23, 19, -3, -3).

What do you think, do you want to stay with (23, 19, -3, -3)?

ksahlin commented 9 months ago

Interesting. I guess we should go for (20, 16, -2, -2) then, if it is a bit faster and has slightly better PE accuracy. If you agree?

Just out of curiosity, did you check how much it increases/decreases rescue/runtime over baseline?

marcelm commented 9 months ago

In terms of "Total calls to ssw", (20, 16, -2, -2) has slightly fewer than the baseline for rye and CHM13 and slightly more for maize and drosophila (~5%).

ksahlin commented 9 months ago

I see, so difficult to predict any clear trends. Nevertheless, it seems favourable to baseline.

ksahlin / strobealign

A new profile for read length 75 #339