rhpvorderman / sequali

Fast sequencing data quality metrics
GNU Affero General Public License v3.0
11 stars 0 forks source link

Current adapter searcher does not work well for nanopore adapters #69

Closed rhpvorderman closed 9 months ago

rhpvorderman commented 9 months ago

The adapter finder currently looks for 12 bp probes with no mismatches. This allows very fast vectorized searching for a finite number of patterns with no mismatches. Current matching rate is measured in GB/s per second rates. However this does miss a lot of adapters on nanopore sequences that have worse than 1 in 12 error rates.

A paper by Fulwider and Mukherjee can be used as a base to allow searching for probes with 1 or 2 mismatches. To prevent false positives, the probes need to be slightly longer.

rhpvorderman commented 9 months ago

The Wu and Manber paper with the original algorithm for approximate matching. For reference.

rhpvorderman commented 9 months ago

First some research is needed in adequate probe length. Several aspects need to be balanced here:

Implementation wise the interesting lengths are 12, 16, 21, 32 and 64. That allows fitting 5, 4, 3, 2 and 1 probe into a machine word respectively. Using the binomial distribution, the amount of false positives and probe detection rates can be calculated.

probe           Probe detection rate for amount of sequencing errors    False positive rates for read length
length  errors    0.10%   1.00%   5.00%  10.00%  20.00%  40.00%  50.00%     151   8,000  20,000
12      0        98.81%  88.64%  54.04%  28.24%   6.87%   0.22%   0.02%  0.001%  0.048%  0.119%
12      1        99.99%  99.38%  88.16%  65.90%  27.49%   1.96%   0.32%  0.044%  2.309%  5.674%
16      0        98.41%  85.15%  44.01%  18.53%   2.81%   0.03%   0.00%  0.000%  0.000%  0.000%
16      1        99.99%  98.91%  81.08%  51.47%  14.07%   0.33%   0.03%  0.000%  0.012%  0.030%
21      1        99.98%  98.15%  71.70%  36.47%   5.76%   0.03%   0.00%  0.000%  0.000%  0.000%
21      2       100.00%  99.88%  91.51%  64.84%  17.87%   0.24%   0.01%  0.000%  0.001%  0.002%
21      3       100.00%  99.99%  98.11%  84.80%  37.04%   1.10%   0.07%  0.000%  0.016%  0.040%
21      4       100.00% 100.00%  99.68%  94.78%  58.60%   3.70%   0.36%  0.006%  0.294%  0.734%
32      4       100.00% 100.00%  97.96%  78.85%  20.44%   0.07%   0.00%  0.000%  0.000%  0.000%
32      6       100.00% 100.00%  99.91%  96.42%  53.55%   0.91%   0.03%  0.000%  0.000%  0.000%
32      8       100.00% 100.00% 100.00%  99.67%  82.54%   5.75%   0.35%  0.001%  0.032%  0.081%
32      10      100.00% 100.00% 100.00%  99.98%  95.89%  20.46%   2.51%  0.062%  3.232%  7.884%
64      10      100.00% 100.00%  99.97%  94.84%  24.10%   0.00%   0.00%  0.000%  0.000%  0.000%
64      16      100.00% 100.00% 100.00%  99.99%  87.46%   0.87%   0.00%  0.000%  0.000%  0.000%
64      20      100.00% 100.00% 100.00% 100.00%  98.90%   9.53%   0.18%  0.000%  0.000%  0.000%
64      22      100.00% 100.00% 100.00% 100.00%  99.78%  21.55%   0.84%  0.000%  0.004%  0.010%
64      23      100.00% 100.00% 100.00% 100.00%  99.91%  29.83%   1.64%  0.001%  0.028%  0.070%
63      24      100.00% 100.00% 100.00% 100.00%  99.97%  43.19%   3.85%  0.009%  0.486%  1.210%

This table shows that the current default of 12 bp with 0 errors performs pretty decently for Illumina short reads. False positive rate is extremely low and the detection rate for typical illumina sequencing errors (<1.00%) is higher than 88%. So adapter counts will be slightly underestimated, but it nevertheless gives a good picture.

For illumina a length of 16 and 1 allowed error would be even better, as this gives a 99%+ detection rate with almost no false positives.

For nanopore reads, reads are on average roughly 8000 bp in length while also often reaching 20000 bp. For the current setting of 12 bp 0 error probes, this means that many probes will be detected at a false positive rate of 0.119%. (Detection threshold is set at 0.1%). Also nanopore reads have an error rate around 10%, with the worst quality being 20% errors. Some parts (the beginning and end of the read) have even worse error rates.

The table shows that for very noisy sequences with 40% or 50% errors detection is almost impossible for short probes. This is the reason sequali currently find very few forward adapters. The forward adapters are at the beginning of the sequence. This is apparently noisier than the 40 or 50% shown here. Since the pore current is determined by 5 bases, the first 5 base calls are likely to be extremely noisy as they cannot rely on the information of the previous 5 base calls. From this it follows that forward probes for nanopore are better taken from the 3' end of the adapter rather than the 5' end.

The table shows that probes of 21 bp with 3 allowed errors allow acceptable detection rates for sequences with a 10% error rate, while also keeping false positive rates low. This would allow near-perfect scores for illumina sequences.

A 32 bp probe with 8 allowed errors is even better, however here we run into the problem of adapters usually not having a big enough length to actually use so many base pairs. Especially considering that the first 5 bp of a nanopore sequence cannot be properly detected.

Rather than making a generalized algorithm where the number of errors can be set, hard coding gives performance benefits as loops can be completely unrolled. If the solution turns out to be slower subsampling can be applied in order to keep the speed high.

rhpvorderman commented 9 months ago

Wikipedia's article on bitap contains a full C example allowing only mismatches for the Hamming distance. The Hamming distance is probably good enough for the very short probe length and will make it possible to retain some speed.

rhpvorderman commented 9 months ago

This was instigated by a file sent to me by @marcelm. Here are some practical tests to validate the foundings above. An apllication named tre-agrep can do approximate matching of words in files and can be used to test these sort of things.

Total number of records in the file 88,518

length allowed errors sequence finds finds (%) comment
12 0 CCTGTACTTCGT 59 0.00 Ligation kit, top strand, first non-T part.
16 1 CCTGTACTTCGTTCAG 31739 35.86 Ligation kit, top strand, first non-T part.
21 3 CCTGTACTTCGTTCAGTTACG 64950 73.37 Ligation kit, top strand, first non-T part.
12 0 TTACGTATTGCT 68780 77.70 Ligation kit, top strand, last bases.
16 1 TCAGTTACGTATTGCT 69373 78.37 Ligation kit, top strand, last bases.
21 3 TTCGTTCAGTTACGTATTGCT 82595 93.31 Ligation kit, top strand, last bases.

It looks like simply placing the probe at the end of the adapter is the most cost effective strategy. The above settings where with the default settings of tre-agrep which calculates edit distance. Setting the insertion and deletion costs to 10 will calculate the table for hamming distance:

length allowed errors sequence finds finds (%) comment
16 1 CCTGTACTTCGTTCAG 30050 33.95 Ligation kit, top strand, first non-T part.
21 3 CCTGTACTTCGTTCAGTTACG 44191 49.92 Ligation kit, top strand, first non-T part.
16 1 TCAGTTACGTATTGCT 67670 76.45 Ligation kit, top strand, last bases.
21 3 TTCGTTCAGTTACGTATTGCT 63254 71.46 Ligation kit, top strand, last bases.

If only hamming distance is taken into account, the 12 bp probe with 0 allowed errors scores better than the error-allowing variant. It seems that the longer the probe is, the more likely a insertion or deletion is to occur. It is known from literature that indels are common in nanopore sequences.

Conclusion, no adaptation for the algorithm is needed, but adapter probes for nanopore need some editing. The last part af the front adapter and the first part of the backadapter is the best candidate given nanopore quality at the extremes of the sequence.