marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 129 forks source link

paired-reads - anchored 5' primer not removed when adding trimming of 3' primer #625

Closed trippster08 closed 2 years ago

trippster08 commented 2 years ago

cutadapt 3.7 installed with conda and run through a conda environment python 3.9.13

I have short (~ 225bp) metabarcoding reads that were amplified using a pair of 12S primers which have had different sized spacers added for bp heterogeneity. Briefly, each forward and reverse primer consisted of a random mix of primers with 5 different lengths (no addition, 5 bp added, 6 bp added, 7 bp added, 8 bp added). However, because these were short amplicons, many read through and contain the reverse primer at the 3' end. When I use cutadapt with only anchored 5' primers, it removes all the forward primers successfully. However, when I add trimming for normal 3' primers (reverse complemented versions of the 12S reverse primers), and change nothing else, it no longer removes all 5' primers, and is also inconsistent in removing the 3' primers. Additionally, even though the 5' primers are considered anchored, it no longer discards the untrimmed primers.

cutadapt script for 5' anchored primer only (I also tried using --discard-untrimmed but this did not change my results at all): cutadapt \ -g file:MiFish_12SF_spacers.fas \ -G file:MiFish_12SR_spacers.fas \ -e 0.1 \ --untrimmed-output untrimmed_sequences/01-12S_untrimmed_noRC_R1.fastq.gz \ --untrimmed-paired-output untrimmed_sequences/01-12S_untrimmed_noRC_R2.fastq.gz \ --minimum-length 30 \ -o trimmed_sequences/01-12S_trimmed_noRC_R1.fastq.gz \ -p trimmed_sequences/01-12S_trimmed_noRC_R2.fastq.gz \ 01-12S_S1_L001_R1_001.fastq.gz \ 01-12S_S1_L001_R2_001.fastq.gz

cutadapt script for 5' anchored and 3' regular primer : cutadapt \ -g file:MiFish_12SF_spacers.fas \ -a file:MiFish_12SR_revcomp_spacers.fas \ -G file:MiFish_12SR_spacers.fas \ -A file:MiFish_12SF_revcomp_spacers.fas \ -e 0.1 \ --untrimmed-output untrimmed_sequences/01-12S_untrimmed_RC_R1.fastq.gz \ --untrimmed-paired-output untrimmed_sequences/01-12S_untrimmed_RC_R2.fastq.gz \ --minimum-length 30 \ -o trimmed_sequences/01-12S_trimmed_RC_R1.fastq.gz \ -p trimmed_sequences/01-12S_trimmed_RC_R2.fastq.gz \ 01-12S_S1_L001_R1_001.fastq.gz \ 01-12S_S1_L001_R2_001.fastq.gz

Primers called for R1: 12SF_spacers.fas: >iTru_MiFish_12S-F \^GTCGGTAAAACTCGTGCCAGC >iTruE_MiFish_12S-F \^AAGCGGTCGGTAAAACTCGTGCCAGC >iTruF_MiFish_12S-F \^GCCACAGTCGGTAAAACTCGTGCCAGC >iTruG_MiFish_12S-F \^CTGGATGGTCGGTAAAACTCGTGCCAGC >iTruH_MiFish_12S-F \^TGATTGACGTCGGTAAAACTCGTGCCAGC

12SR_revcomp_spacers.fas: >iTru_MiFish_12S-Rrc CAAACTGGGATTAGATACCCCACTATG >iTru5_MiFish_12S-Rrc CAAACTGGGATTAGATACCCCACTATGCCTAG >iTru6_MiFish_12S-Rrc CAAACTGGGATTAGATACCCCACTATGTAAGCA >iTru7_MiFish_12S-Rrc CAAACTGGGATTAGATACCCCACTATGACTTCGC >iTru8_MiFish_12S-Rrc CAAACTGGGATTAGATACCCCACTATGATAGGATT

I'm including one example read where attempting to trim the 5' and 3' primers resulted in only the 3' primer being removed, but the 5' primer remained. Example untrimmed R1 - this contains iTruE_MiFish-12S-F and iTru_MiFish_12S-Rrc AAGCGGTCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGCGGCCCTAGTTGATAATTACGGCGTAAAGGGTGGTTAAGGAACTCTACAATTAAAGCCAAACACTTCCCCAGCTGTCATACGCTCCTGGAAATAACGAAGCCCAAACGCGAAAGCAGCTTTATTTTACAAACCTGACCCCACGAAAGCTAAAACACAAACTGGGATTAGATACCCCACTATG Example trimmed 5' anchored only R1 - no longer has 5' primer, still has 3' primer, as expected CACCGCGGTTATACGAGCGGCCCTAGTTGATAATTACGGCGTAAAGGGTGGTTAAGGAACTCTACAATTAAAGCCAAACACTTCCCCAGCTGTCATACGCTCCTGGAAATAACGAAGCCCAAACGCGAAAGCAGCTTTATTTTACAAACCTGACCCCACGAAAGCTAAAACACAAACTGGGATTAGATACCCCACTATG Example trimmed 5' anchored and 3' R1 - still has 5' primer, no longer has 3' primer AAGCGGTCGGTAAAACTCGTGCCAGCCACCGCGGTTATACGAGCGGCCCTAGTTGATAATTACGGCGTAAAGGGTGGTTAAGGAACTCTACAATTAAAGCCAAACACTTCCCCAGCTGTCATACGCTCCTGGAAATAACGAAGCCCAAACGCGAAAGCAGCTTTATTTTACAAACCTGACCCCACGAAAGCTAAAACA

Summary trimmed 5' anchored only Sequence: AAGCGGTCGGTAAAACTCGTGCCAGC; Type: anchored 5'; Length: 26; Trimmed: 16735 times No. of allowed errors: 2 Overview of removed sequences length count expect max.err error counts 24 21 0.0 2 0 0 21 25 267 0.0 2 0 246 21 26 16337 0.0 2 16156 177 4 27 94 0.0 2 0 92 2 28 16 0.0 2 0 0 16

Summary trimmed 5' anchored and 3' R1 5' summary: Sequence: AAGCGGTCGGTAAAACTCGTGCCAGC; Type: anchored 5'; Length: 26; Trimmed: 5174 times No. of allowed errors: 2 Overview of removed sequences length count expect max.err error counts 24 4 0.0 2 0 0 4 25 72 0.0 2 0 65 7 26 5056 0.0 2 5000 55 1 27 26 0.0 2 0 24 2 28 16 0.0 2 0 0 16 3' summary: Sequence: CAAACTGGGATTAGATACCCCACTATG; Type: regular 3'; Length: 27; Trimmed: 3750 times Minimum overlap: 3 No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1; 20-27 bp: 2 Bases preceding removed adapters: A: 94.3% C: 4.8% G: 0.3% T: 0.6% none/other: 0.0% WARNING: The adapter is preceded by 'A' extremely often. The provided adapter sequence could be incomplete at its 5' end. Ignore this warning when trimming primers. Overview of removed sequences length count expect max.err error counts 12 1 0.0 1 1 25 2 0.0 2 1 0 1 26 8 0.0 2 0 8 27 3725 0.0 2 3704 17 4 28 5 0.0 2 0 5 33 1 0.0 2 0 0 1 35 1 0.0 2 0 1 54 1 0.0 2 0 0 1 55 3 0.0 2 0 2 1 56 1 0.0 2 0 0 1 57 1 0.0 2 0 0 1 58 1 0.0 2 1

I'm sorry for such a long message. I feel like I'm missing some simple command, but I've gone over this so many times and read the user guide so often that I don't think I'm seeing it straight anymore. Thank you for your help.

marcelm commented 2 years ago

I’m traveling at the moment and would be happy to provide more info later. But briefly: Cutadapt by default removes at most one adapter per read. So looking at R1 only, if you provide -g and -a, then Cutadapt removes either the 5' adapter or the 3' adapter, depending on which one matches better. For your use case, using linked adapters may be appropriate. These combine a 5' and a 3' adapter.

trippster08 commented 2 years ago

Thank you for the quick response. I see I was missing something simple. I thought removing at most one adapter meant it would only remove 1 -g or 1 -a, not that it would remove either -g or -a, but that makes sense. I missed the paragraph about previously (before linked adapters) having to use --times. Thank you again for your help, and I'll try linked adapters. I've used these with single primer pairs, but never with combinations of multiple F and R primers.

trippster08 commented 2 years ago

I'm sorry for having more questions while you are traveling. Using linked adapters means having to put all possible permutations (25 in this case) of the 5' primer and 3' primer directly in the command line, correct? I typically use "file:" to specify multiple primers because it makes my scripts more adaptable for multiple users with differing primer sets, but "file:" seems to require a fasta file which the linked adapters format is not. Am I understanding this correctly? This is why I didn't use linked adapters for this analysis to start.

marcelm commented 2 years ago

thought removing at most one adapter meant it would only remove 1 -g or 1 -a,

Yes, that behavior would make more sense to be honest, but I’m reluctant to change it because it would definitely be backwards incompatible. I think I should add a command-line option to enable this, though.

You can put anythng you would write after the -a or -g on the command-line also into the FASTA file:

>mylinkedadapter1
^AACC...GGTT

It’s a bit annoying to specify all 5'/3' adapter combinations this way, so perhaps you should go with --times=2 instead (until I add that command-line option I mentioned). The only downside then would be that two 5' adapters or two 3' adapters get removed from a read.

Alternatively, run Cutadapt twice: Remove the 5' adapters in the first round, then the 3' adapters.

marcelm commented 2 years ago

I hope you got it working now. I’ve added issue #633 for the suggested command-line option to make Cutadapt search for one 5' and one 3' adapter in each read.