marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
514 stars 129 forks source link

Overlap analysis #701

Closed rhpvorderman closed 1 year ago

rhpvorderman commented 1 year ago

Hi Marcel,

I was taking a look at fastp the other day, and they have a rather interesting adapter trimming method. Cutadapt currently aligns the adapter sequence:

SEQUENCEOFINTERESTADAPTERGARBAGE
------------------ADAPTER-------

But therefore it cannot detect:

SEQUENCEOFINTERESTAD
------------------AD

Fastp uses its paired-end information for overlap analysis. Simply take the reverse complement and see where it matches.

--------------SEQUENCEOFINTERESTADAPTERGARBAGE
GARBAGEADAPTERSEQUENCEOFINTEREST--------------

As a result it can also match this:

--SEQUENCEOFINTERESTAD
ADSEQUENCEOFINTEREST--

The advantage is thus that even very small adapter remnants can be removed and that this method is less sensitive to errors in the adapter sequence when adapter_length < 10 (with the 0.1 allowed errors cutadapt uses by default). Another advantage is that it only requires one pairwise alignment per read-pair rather than two. The disadvantages are that since the alignment sequences are not known beforehand this limits opportunities to optimize.

marcelm commented 1 year ago

Yes, this would be nice to have. There’s actually an open issue about this at #332.

rhpvorderman commented 1 year ago

Ah sorry, I searched for "overlap" and nothing came up. I will close the duplicate.