Current adapter searcher does not work well for nanopore adapters

rhpvorderman commented 9 months ago

The adapter finder currently looks for 12 bp probes with no mismatches. This allows very fast vectorized searching for a finite number of patterns with no mismatches. Current matching rate is measured in GB/s per second rates. However this does miss a lot of adapters on nanopore sequences that have worse than 1 in 12 error rates.

A paper by Fulwider and Mukherjee can be used as a base to allow searching for probes with 1 or 2 mismatches. To prevent false positives, the probes need to be slightly longer.

rhpvorderman commented 9 months ago

The Wu and Manber paper with the original algorithm for approximate matching. For reference.

rhpvorderman commented 9 months ago

First some research is needed in adequate probe length. Several aspects need to be balanced here:

False positive rates. This is affected by the chance of a probe appearing in a random DNA sequence. If the probe is too short or the allowed errors too high, lots of adapters will be detected while they are not there.
Detection rate. This is affected by the sequencing error rate. If the allowed errors in the probe is too low or the probe is too long, many adapters will not be detected.
Available length. Probes can never be longer than the actual adapter, a machine word can fit at most 64 bp probes.
implementation speed. Allowing more errors incurs more cost of the bitwise method above is used. No errors at all is very quick.

Implementation wise the interesting lengths are 12, 16, 21, 32 and 64. That allows fitting 5, 4, 3, 2 and 1 probe into a machine word respectively. Using the binomial distribution, the amount of false positives and probe detection rates can be calculated.

probe           Probe detection rate for amount of sequencing errors    False positive rates for read length
length  errors    0.10%   1.00%   5.00%  10.00%  20.00%  40.00%  50.00%     151   8,000  20,000
12      0        98.81%  88.64%  54.04%  28.24%   6.87%   0.22%   0.02%  0.001%  0.048%  0.119%
12      1        99.99%  99.38%  88.16%  65.90%  27.49%   1.96%   0.32%  0.044%  2.309%  5.674%
16      0        98.41%  85.15%  44.01%  18.53%   2.81%   0.03%   0.00%  0.000%  0.000%  0.000%
16      1        99.99%  98.91%  81.08%  51.47%  14.07%   0.33%   0.03%  0.000%  0.012%  0.030%
21      1        99.98%  98.15%  71.70%  36.47%   5.76%   0.03%   0.00%  0.000%  0.000%  0.000%
21      2       100.00%  99.88%  91.51%  64.84%  17.87%   0.24%   0.01%  0.000%  0.001%  0.002%
21      3       100.00%  99.99%  98.11%  84.80%  37.04%   1.10%   0.07%  0.000%  0.016%  0.040%
21      4       100.00% 100.00%  99.68%  94.78%  58.60%   3.70%   0.36%  0.006%  0.294%  0.734%
32      4       100.00% 100.00%  97.96%  78.85%  20.44%   0.07%   0.00%  0.000%  0.000%  0.000%
32      6       100.00% 100.00%  99.91%  96.42%  53.55%   0.91%   0.03%  0.000%  0.000%  0.000%
32      8       100.00% 100.00% 100.00%  99.67%  82.54%   5.75%   0.35%  0.001%  0.032%  0.081%
32      10      100.00% 100.00% 100.00%  99.98%  95.89%  20.46%   2.51%  0.062%  3.232%  7.884%
64      10      100.00% 100.00%  99.97%  94.84%  24.10%   0.00%   0.00%  0.000%  0.000%  0.000%
64      16      100.00% 100.00% 100.00%  99.99%  87.46%   0.87%   0.00%  0.000%  0.000%  0.000%
64      20      100.00% 100.00% 100.00% 100.00%  98.90%   9.53%   0.18%  0.000%  0.000%  0.000%
64      22      100.00% 100.00% 100.00% 100.00%  99.78%  21.55%   0.84%  0.000%  0.004%  0.010%
64      23      100.00% 100.00% 100.00% 100.00%  99.91%  29.83%   1.64%  0.001%  0.028%  0.070%
63      24      100.00% 100.00% 100.00% 100.00%  99.97%  43.19%   3.85%  0.009%  0.486%  1.210%

This table shows that the current default of 12 bp with 0 errors performs pretty decently for Illumina short reads. False positive rate is extremely low and the detection rate for typical illumina sequencing errors (<1.00%) is higher than 88%. So adapter counts will be slightly underestimated, but it nevertheless gives a good picture.

For illumina a length of 16 and 1 allowed error would be even better, as this gives a 99%+ detection rate with almost no false positives.

For nanopore reads, reads are on average roughly 8000 bp in length while also often reaching 20000 bp. For the current setting of 12 bp 0 error probes, this means that many probes will be detected at a false positive rate of 0.119%. (Detection threshold is set at 0.1%). Also nanopore reads have an error rate around 10%, with the worst quality being 20% errors. Some parts (the beginning and end of the read) have even worse error rates.

The table shows that for very noisy sequences with 40% or 50% errors detection is almost impossible for short probes. This is the reason sequali currently find very few forward adapters. The forward adapters are at the beginning of the sequence. This is apparently noisier than the 40 or 50% shown here. Since the pore current is determined by 5 bases, the first 5 base calls are likely to be extremely noisy as they cannot rely on the information of the previous 5 base calls. From this it follows that forward probes for nanopore are better taken from the 3' end of the adapter rather than the 5' end.

The table shows that probes of 21 bp with 3 allowed errors allow acceptable detection rates for sequences with a 10% error rate, while also keeping false positive rates low. This would allow near-perfect scores for illumina sequences.

A 32 bp probe with 8 allowed errors is even better, however here we run into the problem of adapters usually not having a big enough length to actually use so many base pairs. Especially considering that the first 5 bp of a nanopore sequence cannot be properly detected.

Rather than making a generalized algorithm where the number of errors can be set, hard coding gives performance benefits as loops can be completely unrolled. If the solution turns out to be slower subsampling can be applied in order to keep the speed high.

rhpvorderman commented 9 months ago

Wikipedia's article on bitap contains a full C example allowing only mismatches for the Hamming distance. The Hamming distance is probably good enough for the very short probe length and will make it possible to retain some speed.

rhpvorderman commented 9 months ago

This was instigated by a file sent to me by @marcelm. Here are some practical tests to validate the foundings above. An apllication named tre-agrep can do approximate matching of words in files and can be used to test these sort of things.

Total number of records in the file 88,518

length	allowed errors	sequence	finds	finds (%)	comment
12	0	CCTGTACTTCGT	59	0.00	Ligation kit, top strand, first non-T part.
16	1	CCTGTACTTCGTTCAG	31739	35.86	Ligation kit, top strand, first non-T part.
21	3	CCTGTACTTCGTTCAGTTACG	64950	73.37	Ligation kit, top strand, first non-T part.
12	0	TTACGTATTGCT	68780	77.70	Ligation kit, top strand, last bases.
16	1	TCAGTTACGTATTGCT	69373	78.37	Ligation kit, top strand, last bases.
21	3	TTCGTTCAGTTACGTATTGCT	82595	93.31	Ligation kit, top strand, last bases.

It looks like simply placing the probe at the end of the adapter is the most cost effective strategy. The above settings where with the default settings of tre-agrep which calculates edit distance. Setting the insertion and deletion costs to 10 will calculate the table for hamming distance:

length	allowed errors	sequence	finds	finds (%)	comment
16	1	CCTGTACTTCGTTCAG	30050	33.95	Ligation kit, top strand, first non-T part.
21	3	CCTGTACTTCGTTCAGTTACG	44191	49.92	Ligation kit, top strand, first non-T part.
16	1	TCAGTTACGTATTGCT	67670	76.45	Ligation kit, top strand, last bases.
21	3	TTCGTTCAGTTACGTATTGCT	63254	71.46	Ligation kit, top strand, last bases.

If only hamming distance is taken into account, the 12 bp probe with 0 allowed errors scores better than the error-allowing variant. It seems that the longer the probe is, the more likely a insertion or deletion is to occur. It is known from literature that indels are common in nanopore sequences.

Conclusion, no adaptation for the algorithm is needed, but adapter probes for nanopore need some editing. The last part af the front adapter and the first part of the backadapter is the best candidate given nanopore quality at the extremes of the sequence.

rhpvorderman / sequali

Current adapter searcher does not work well for nanopore adapters #69