marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

Very different trimming result between version 3.7 and 4.4 #715

Closed fjossandon closed 11 months ago

fjossandon commented 12 months ago

Hello again! I wanted to migrate from my old Cutadapt 3.7 to the shiny new 4.4, but after some testing, I saw that I was getting very different results, where the only difference is the cutadapt version. My input parameters are the following:

cutadapt \
    -a "ATTAGAWACCCBDGTAGTCC;max_error_rate=0.1;min_overlap=20" \
    -A "TTACCGCGGCKGCTGRCAC;max_error_rate=0.5;min_overlap=16" \
    --pair-adapters \
    --pair-filter any \
    --cores 2 \
    --output 230553_R1_TRIMMED.fastq \
    --paired-output 230553_R2_TRIMMED.fastq \
    230553_S1_L001_R1_001.fastq.gz \
    230553_S1_L001_R2_001.fastq.gz

A long time ago, I tweaked those parameters of error rate and overlap to give the most optimum result for my dataset. I can paste here the whole output log if you need it, but this is the main difference:

CUTADAPT 3.7
Total basepairs processed:    87,244,248 bp
  Read 1:    43,622,124 bp
  Read 2:    43,622,124 bp
Total written (filtered):     61,204,595 bp (70.2%)
  Read 1:    30,617,402 bp
  Read 2:    30,587,193 bp

CUTADAPT 4.4
Total basepairs processed:    87,244,248 bp
  Read 1:    43,622,124 bp
  Read 2:    43,622,124 bp
Total written (filtered):     31,604,004 bp (36.2%)
  Read 1:    30,617,402 bp
  Read 2:       986,602 bp

Many many bases are lost in the reverse reads, with too many being filtered.

For example in this reverse read, below I'm putting the raw read (with the searched primer above the sequence line to indicate where it aligns, in the nucleotide 211), followed by the 3.7 trimmed read and the 4.4 trimmed read.

ORIGINAL RAW READ
@M08602:31:000000000-L4HJ5:1:1101:17171:1124 2:N:0:CGAGAGTT+ATCGTACG                                                                                                                                              TTACCGCGGCKGCTGRCAC
CATTTAATGATCGTCTCCTAATTTTAGTTCAGTTGCTAGATATCACCGCCAAAAGGGTAATGAAACTGTGAAACCATCGGAGCCAAACAGTTGAGTCAGAACCAGCCTTCATGGGGTTGTCGGAATCTGGATTTAGAACTGTACTCCCTAGCCTCGCGGGCACAGCCTCGTATCCATCTATCAGGAGAACCAGCGTTCCACAGTACCTCTTTACCGCGGCTGCTGACACTCCCAACTATTCCGTACGATGTGTAGTTCTCTGTTGGCTCCGTCTCCTTTATAAAACACACGTGCACATGAC
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGADFCGGGGGFGFFFGFGFFGGFFFFECAEFFFEEFGFD9DCBFFFFFDE;35FFFFFF4566?(/25826;46-*))/*..)6))53:())(1(0)0+.)))))..)3.(((.,(())..))

#####

TRIMMED 3.7 READ
@M08602:31:000000000-L4HJ5:1:1101:17171:1124 2:N:0:CGAGAGTT+ATCGTACG
CATTTAATGATCGTCTCCTAATTTTAGTTCAGTTGCTAGATATCACCGCCAAAAGGGTAATGAAACTGTGAAACCATCGGAGCCAAACAGTTGAGTCAGAACCAGCCTTCATGGGGTTGTCGGAATCTGGATTTAGAACTGTACTCCCTAGCCTCGCGGGCACAGCCTCGTATCCATCTATCAGGAGAACCAGCGTTCCACAGTACCTCT
+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDFGGGGGGADFCGGGGGFGFFFGFGFFGGFFFFECAEFF

#####

TRIMMED 4.4 READ
@M08602:31:000000000-L4HJ5:1:1101:17171:1124 2:N:0:CGAGAGTT+ATCGTACG
CAT
+
CCC

As you can see, the read was trimmed by 3.7 in the correct place, while 4.4 just erased almost the whole read, and the quality scores were fine too (in case that is a factor), and it's the same for almost all reads of that Fastq file. I also tried the 4.0 version and had the same result as 4.4, so it's definitely the 4.0 change.

Looking directly at the reverse read sequence, I can see that 3.7 was aligned correctly, so I'm not sure why it would be different in 4.4... It could be the alignment algorithm change, the parsing of my arguments being interpreted differently (maybe I need to use other arguments???), some change in the filtering, or something else I'm not seeing. I can send you the whole file if that helps. Please also note that here I'm concerned about the sequence trimming and not necessarily the read quality, since afterward I proceed to merge forwards and reverses, and quality filter them post-merge, but here the lack of reverse sequences prevents the merge that comes after.

This is the thing that is preventing me to migrate from 3.7 to 4.4. Hope you can give me some light on this.

Best regards,

marcelm commented 12 months ago

Hi and thanks for the detailed report!

This is very likely due to this change (quoting from the change log):

The alignment algorithm was tweaked [...] to more accurately pick the leftmost adapter occurrence if there are multiple.

You can read about this in the algorithm description starting from the sentence "A second change in the alignment algorithm ...".

In your case, the maximum error rate of 0.5 for the R2 adapter is very high. At its length of 19 nucleotides, this allows up to 9 errors. The sequence also contains two IUPAC wildcard characters, which further increases the chance that the sequence matches by chance. I believe this is the alignment that Cutadapt finds:

   TTACCGCGGCKGCTGRCAC
      xx xxx  x  xx x 
CATTTAATGAT-CGTCT--C-CTAATTTTAGTTCAGTTGCTAGATATCACCGCCAAAAGGGTAA

This has "only" nine errors, which is why the alignment algorithm reports it.

I did an experiment generating 100'000 totally random reads and ran Cutadapt on them with the above adapter, minimum overlap and maximum error rate settings, and all but a single read (!) had an occurrence of that sequence.

Note that Cutadapt currently does not take into account the maximum error rate in the "expect" column in the report. That would have otherwise shown very high numbers and been a clue on what is going on.

From the above, I conclude that the maximum error rate setting of 0.5 is just too high and leads to highly non-specific trimming. The algorithm in 3.7 masked this overly sensitive setting because for multiple possible matches, it sometimes preferred matches towards the 3' end that have lower errors.

fjossandon commented 12 months ago

Wow, I'm shocked by your alignment! I never thought it would align like that, I guess I cannot foresee the indels by eye.

Thanks for the explanation! I see, it looks like I have to readjust those error rates with the new version, a max error rate of 0.3 in v4.4 is the one that gives results closer to version 3.7... Usually, the error rate in reverse reads is higher, and I think that some of my older datasets are even higher, so this somehow worked in 3.7, but not anymore.

The algorithm in 3.7 masked this overly sensitive setting because for multiple possible matches, it sometimes preferred matches towards the 3' end that have lower errors.

Would this change affect --times somehow?? There is another command I use to predict amplicons in larger fasta sequences, and with the previous version adjusting "times" would optimize the number of predicted sequences within the expected size range. If Cutadapt v4.4 does not align where it used to in v.3.7, would this means that I should check if --times need to be reduced too?

marcelm commented 10 months ago

Would this change affect --times somehow?? There is another command I use to predict amplicons in larger fasta sequences, and with the previous version adjusting "times" would optimize the number of predicted sequences within the expected size range. If Cutadapt v4.4 does not align where it used to in v.3.7, would this means that I should check if --times need to be reduced too?

Hi, I’m back from vacation and I hope you solved this already, but I thought I’d reply for completeness.

I’m not sure how --times would be affected in your application. The behavior would be different sometimes, but maybe the result would be the same.

Let’s say you have three occurrences a1, a2, a3 of an adapter/primer in your read like this:

AAA a1 CCC a2 GGG a3 TTT

If you use --times=3, Cutadapt does three rounds of trimming. With Cutadapt 3.7, the results of these rounds could (sometimes!) be like this:

AAA a1 CCC a2 GGG
AAA a1 CCC
AAA

If the desired result is the AAA, then using --times=3 would be necessary because otherwise the output would sometimes be from one of the intermediate rounds.

With the current algorithm, it is more likely that the leftmost occurrence (a1) is found already in the first round, so you would see the same result even with --times=1.

But I want to emphasize that it has always been the intention that the algorithm finds the leftmost occurrence of a query sequence, and it does so in most cases even in 3.7. The above, where a3 or a2 is trimmed instead of a1 is just something that happened under some circumstances in 3.7.