marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
501 stars 126 forks source link

Trimming based on PE read alignment #332

Open rspreafico-sgi opened 5 years ago

rspreafico-sgi commented 5 years ago

Hi, thanks for developing this awesome tool!

For PE trimming, it would be extremely helpful to use an adapter trimming mode based on read1/read2 alignment, as opposed to separate alignment of the adapter to each read. This would result in a more sensitive and specific trimming. Basically, this is one of the two trimming modes implemented by Trimmomatic. For cutadapt, I wonder whether one could fork this feature back from atropos, which is itself a fork of cutadapt, but no longer maintained in the past 6 months, and implements this mode of trimming.

PE reads are way more common nowadays than they used to be only 1-2 years ago, and this mode of trimming would be very welcome. It is indeed the only reason why I sometimes still need to use Trimmomatic rather than cutadapt.

Thanks for considering this enhancement!

marcelm commented 5 years ago

Hm, you may have a point. I haven’t followed Atropos development, but perhaps it’s time to see whether some of the code can be re-integrated into Cutadapt. Note to myself: Docs are at https://atropos.readthedocs.io/en/latest/guide.html#alignment-algorithms

marcelm commented 2 years ago

A couple of thoughts because this came up just now in #552:

I’ve had a section about how to do the alignment in my PhD thesis. The idea is to not just align R1 with the reverse complement of R2, but to also take the adapters into account.

Given (adapters are 3' adapters):

R1 is some prefix of: fragment | R1_adapter R2 is some prefix of: rc_fragment | R2_adapter

Then

rc of R2 is some suffix of rc_R2_adapter | fragment

We can then semi-globally align the following strings:

rc_R2_adapter | R1 and rc_R2 | R1_adapter

Then the traceback through the alignment matrix gives the trimming position(s).

I still think this is a neat way of doing the alignment, but there are a couple of problems:

At least initially, a solution that is possibly good enough may be to align adapters to both reads as currently, but then if a match was found on one read but not the other, one could do some "rescue" thing with increased maximum allowed errors.

On the other hand, if we do the full alignment, we can easily extend this to merge paired-end reads.

rhpvorderman commented 1 year ago

On good libraries, the insert size is likely such that adapters are few. Since the distribution of insert sizes follows a bell curve (? not entirely certain about this, but it seems to be the case), this means that adapter overlaps of 1, 2 and 3 are more common than adapter overlaps of 30 and more.

The most likely case is that there are no adapters. The second most likely case is that not the entire adapter sequence is present. In both these cases using the paired-end alignment will yield much better results as it has more information to go on. Only when the insert size is smaller than the length of the adapter the adapter alignment will fair better in case of the paired-end reads.

Full semiglobal alignment is possible with the alignment algorithm in Cutadapt, but slow at the moment. Realistically, we need to start using an external library (but perhaps not in the first version of this)

It can be customized a bit to make it less expensive. Say take a seed length of 32 and allow 3 errors. That will limit the size of the matrix (or array) to compute. Then after that a less expensive seed extension algorithm can be run. Because previous positions do not need to be taken into account and a maximum of 3 indels is allowed, only 3 cells of the matrix in either direction need to be calculated. Less if there is a separate indel score, which is highly recommended for illumina data, where these errors are rare. Indels occur typically at a 1:100 ratio compared to nucleotide switches.

Also for an adapter to occur the beginning of read1 must be present in the reverse complement of read 2. If the insert size is quite large this does not occur. Which means the kmer trick can be used to quickly check whether the beginning of read1 is present. With a seed length of 32 we can use 4 8-mers and this will give a very low false positive rate. The only chance of a false negative is if the insert size is less than 8. In that case we can run the adapter matching algorithm as a backup.

So there are some options to do this more speedy, I think. Though there will be quite some code complexity involved to achieve optimal performance.