marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

why the 5' adapter wasn't trimmed when adding --revcomp. #739

Closed permia closed 8 months ago

permia commented 8 months ago

This is cutadapt 4.4 with Python 3.10.12

I can't understand why the 5' adapter wasn't trimmed when adding --revcomp.

 echo -e ">r\nAGATGTGTATAAGAGACAGaaaacgagaagtagtgtgactactattcggtgagagagtagcgcgagttagagataagaatatacgagaagtagtgtgactactattcggtgagagagtagcgcgagtagagaagacaacattagacagagtaagacaacagtagataccagttcatataattcaactcgaacgatttactcgtaacgacgaagcgaagtaagtagttgagactatggtagagaagacaacataaaaagaCTGTCTCTTATACACATCTCTGAGACTGCCAAGGCACACAGGCTGTCTCTTATACACATCTGGTATAGGNNNNNGNNNNNN" \
| cutadapt --quiet \
-g "PrefixNX=AGATGTGTATAAGAGACAG;max_error_rate=0.15;min_overlap=5" \
--revcomp \
-n 3 -

below are the trimed result. It's reverse-complement sequence. The 5' adapter still locates at terminal.

>r rc
tctttttatgttgtcttctctaccatagtctcaactacttacttcgcttcgtcgttacgagtaaatcgttcgagttgaattatatgaactggtatctactgttgtcttactctgtctaatgttgtcttctctactcgcgctactctctcaccgaatagtagtcacactacttctcgtatattcttatctctaactcgcgctactctctcaccgaatagtagtcacactacttctcgttttCTGTCTCTTATACACATCT

The following commond seems to work as expected.

echo -e ">r\nAGATGTGTATAAGAGACAGaaaacgagaagtagtgtgactactattcggtgagagagtagcgcgagttagagataagaatatacgagaagtagtgtgactactattcggtgagagagtagcgcgagtagagaagacaacattagacagagtaagacaacagtagataccagttcatataattcaactcgaacgatttactcgtaacgacgaagcgaagtaagtagttgagactatggtagagaagacaacataaaaagaCTGTCTCTTATACACATCTCTGAGACTGCCAAGGCACACAGGCTGTCTCTTATACACATCTGGTATAGGNNNNNGNNNNNN" \
| cutadapt --quiet \
-g "PrefixNX=AGATGTGTATAAGAGACAG;max_error_rate=0.15;min_overlap=5" \
-a "PrefixNX_comp=CTGTCTCTTATACACATCT;max_error_rate=0.3;min_overlap=5" \
-n 2 -
marcelm commented 8 months ago

The adapter occurs twice in the reverse complement but only once in the forward sequence, which is why Cutadapt prefers the reverse complemented read.

This is what Cutadapt does:

  1. Work on the read in the forward direction: a) First round: Adapter is found right at the start and removed. b) Second round: Adapter not found, so no further rounds are attempted.
  2. Reverse-complement the original read and work on that: a) First round: Adapter is found a little bit downstream (nnnnnncnnnnncctataccAGATGTGTATAAGAGACAGcc...) and removed. b) Second round: Adapter is found once again a little bit downstream and removed. c) Third round: Adapter not found.
  3. Compare total scores for forward vs reverse-complement: It’s higher for the reverse complement because there are two occurrences, so the trimmed reverse-complemented read is what is output.

I think what may have been unclear is that the rounds are done for both directions separately.

The way to solve this is to use your second command as you already figured out.

permia commented 8 months ago

Thanks. I figured it out.