marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

Catching ONT adapters: allow setting insertion and deletion scores #745

Closed rhpvorderman closed 6 months ago

rhpvorderman commented 7 months ago

Hi Marcel,

Currently I am working on re-evaluating the Nanopore QC best practices. I did some research on the adapters and as stated in #742 indels should not be penalized to compare to other mismatches. Unfortunately I found no command line flag on cutadapt that could do this, even though the algorithm is suited for this use case.

In practice this means that setting a 0.2 error rate on a 28 bp adapter leads to 5 allowed errors, but only 2 allowed indels. In Nanopore sequencing, the indels at that part of the sequence are just as common as the substitutions, so that will lead to a lot of false negatives. The adapter sequence is usually present in all the reads.

I wanted to patch this in myself, but I am running into some issues with understanding the code:

https://github.com/marcelm/cutadapt/blob/3407ac0004d04b11ae7157934a6665ecaf82c328/src/cutadapt/_align.pyx#L215-L223

It looks like there is a difference between insertion_score/insertion_cost with the same applying for deletions. These constants are evaluated in different pieces of the code.

I am happy to see that all the costs and scores are at least saved as struct attributes in self, which means that the code does not need any major overhauls to support this feature.

marcelm commented 6 months ago

You may not need to change the parameters. In short, the algorithm uses unit costs (for the error rate). The scores are only used for some "refinement". More details are here: https://cutadapt.readthedocs.io/en/stable/algorithms.html#algorithm-indel-scores

In practice this means that setting a 0.2 error rate on a 28 bp adapter leads to 5 allowed errors, but only 2 allowed indels.

This should not be the case. Error rate 0.2 on a 28 bp adapter means at most two errors are allowed. Mismatches, insertions and deletions each count as one error (unit costs).

rhpvorderman commented 6 months ago

Great! I was just thinking about it. It does make sense to allow a certain number of errors (cost) but to prefer mismatches over indels in the score. I understand the algorithm better now.

That means I can recommend cutadapt to everyone for their Nanopore data! I will spread the word. Thanks!

EDIT: Just a tiny nitpick:

Error rate 0.2 on a 28 bp adapter means at most two errors are allowed.

0.2 x 28 = 5.6, so that should be 5 errors right?

marcelm commented 6 months ago

0.2 x 28 = 5.6, so that should be 5 errors right?

Sorry, of course it should be 5!

rhpvorderman commented 6 months ago

Ah thanks. I was worried that I had misunderstood the algorithm :sweat_smile: .