marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Primers remaining in the output files after demultiplexing #533

Closed cpavloud closed 1 week ago

cpavloud commented 3 years ago

Hello!

I am using cutadapt to demultiplex my reads (v2.10 with Python 3.7.6).

Basically, I have a pair of fastq files (e.g. input_R1_001.fastq and input_R2_001.fastq) which contain sequences from three genes (18S, COI and ITS) and I want to create a pair of fastq files for each gene.

Here is an example of the command I am using: `cutadapt --pair-adapters -g TGGTGCATGGCCGTTCTTAGT -a GGTCTGTGATGCCCTTAGATG -G CATCTAAGGGCATCACAGACC -A ACTAAGAACGGCCATGCACCA -o 18S_R1_001.fastq -p 18S_R2_001.fastq input_R1_001.fastq input_R2_001.fastq --untrimmed-output R1_untrimmed_18S.fastq --untrimmed-paired-output R2_untrimmed_18S.fastq

cutadapt --pair-adapters -g GGWACWGGWTGAACWGTWTAYCCYCC -a TGRTTYTTYGGNCAYCCNGARGTNTA -G TANACYTCNGGRTGNCCRAARAAYCA -A GGRGGRTAWACWGTTCAWCCWGTWCC -o COI_R1_001.fastq -p COI_R2_001.fastq input_R1_001.fastq input_R2_001.fastq --untrimmed-output R1_untrimmed_COI.fastq --untrimmed-paired-output R2_untrimmed_COI.fastq

cutadapt --pair-adapters -g CTTGGTCATTTAGAGGAAGTAA -a GCATCGATGAAGAACGCAGC -G GCTGCGTTCTTCATCGATGC -A TTACTTCCTCTAAATGACCAAG -o ITS_R1_001.fastq -p ITS_R2_001.fastq input_R1_001.fastq input_R2_001.fastq --untrimmed-output R1_untrimmed_ITS.fastq --untrimmed-paired-output R2_untrimmed_ITS.fastq `

The problem is that there are still primers found in the output files, so they are not completely trimmed/removed by cutadapt. Do you have any idea why this might be happening?

marcelm commented 3 years ago

For each of the commands that you listed, either the 5' adapter or the 3' adapter is removed from each read (depending on which one matches better) because Cutadapt by default removes only a single adapter from each read. (This holds even for paired-end reads: It will remove at most one from R1 and at most one from R2.)

You should probably use linked adapters if you want to remove a 5' and a 3' adapter at the same time.

So the first command would look something like this:

cutadapt \
  --pair-adapters \
  -a ^TGGTGCATGGCCGTTCTTAGT...GGTCTGTGATGCCCTTAGATG \
  -A ^CATCTAAGGGCATCACAGACC...ACTAAGAACGGCCATGCACCA \
  -o 18S_R1_001.fastq \
  -p 18S_R2_001.fastq \
  --untrimmed-output R1_untrimmed_18S.fastq \
  --untrimmed-paired-output R2_untrimmed_18S.fastq \
  input_R1_001.fastq \
  input_R2_001.fastq 

However, you can also use Cutadapt’s ability for demultiplexing to simplify the three commands into a single one (untested):

cutadapt \
  --pair-adapters \
  -a 18S=^TGGTGCATGGCCGTTCTTAGT...GGTCTGTGATGCCCTTAGATG \
  -A ^CATCTAAGGGCATCACAGACC...ACTAAGAACGGCCATGCACCA \
  -a COI=^GGWACWGGWTGAACWGTWTAYCCYCC...TGRTTYTTYGGNCAYCCNGARGTNTA \
  -A ^TANACYTCNGGRTGNCCRAARAAYCA...GGRGGRTAWACWGTTCAWCCWGTWCC \
  -a ITS=^CTTGGTCATTTAGAGGAAGTAA...GCATCGATGAAGAACGCAGC \
  -A ^GCTGCGTTCTTCATCGATGC...TTACTTCCTCTAAATGACCAAG \
  -o "output_{name}_R1.fastq" \
  -p "output_{name}_R2.fastq" \
  input_R1_001.fastq \
  input_R2_001.fastq

This will then create six output files output_18S_R1.fastq, output_COI_R1.fastq and so on and even output_unknown_R1.fastq and output_unknown_R2.fastq for the reads without a match.

naurasd commented 3 years ago

Hi @marcelm

thanks for your suggestions regarding this issue. Just a brief question:

When specifying for example -a COI=^GGWACWGGWTGAACWGTWTAYCCYCC...TGRTTYTTYGGNCAYCCNGARGTNTA \ for forward reads, I am wondering if this creates a new problem. If there is no read through during MiSeq paired-end 300bp sequencing because the amplicon is longer than 300bp, then the reverse complement of the reverse primer (the second sequence specified here) will not be found in the forward reads. Does that mean then that the read will be discarded or written to the unknown.fastq file, because both specified primer sequences need to be found in a read for being considered for pairing and trimming?

Thanks,

Nauras

marcelm commented 3 years ago

For exactly the reason you mention (because the read may not be short enough), the second specified sequence is actually optional. So whether a read is considered to be untrimmed or not is entirely determined by the 5' adapter only, but the 3' adapter will still be trimmed if found.

In slightly more detail: When you use -a to specify a linked adapter, a constituent adapter that is anchored is also considered to be required (and a non-anchored one is considered optional). The documentation has a bit more detail (but it may not be 100% clear on this): https://cutadapt.readthedocs.io/en/latest/guide.html#linked-adapters-combined-5-and-3-adapter

naurasd commented 3 years ago

thanks so much for the quick reply. had a look again at the documentation and the issue is actually described properly there.

thanks Nauras

peterjc commented 3 years ago

And on the other hand, if you wanted to require both the forward and reverse primers be present, you'd use -g instead of -a

naurasd commented 3 years ago

Thanks Peter, good to know.

naurasd commented 3 years ago

Just as a little hint when using the following demultiplexing command:

cutadapt \
  --pair-adapters \
  -a 18S=^TGGTGCATGGCCGTTCTTAGT...GGTCTGTGATGCCCTTAGATG \
  -A ^CATCTAAGGGCATCACAGACC...ACTAAGAACGGCCATGCACCA \
  -a COI=^GGWACWGGWTGAACWGTWTAYCCYCC...TGRTTYTTYGGNCAYCCNGARGTNTA \
  -A ^TANACYTCNGGRTGNCCRAARAAYCA...GGRGGRTAWACWGTTCAWCCWGTWCC \
  -a ITS=^CTTGGTCATTTAGAGGAAGTAA...GCATCGATGAAGAACGCAGC \
  -A ^GCTGCGTTCTTCATCGATGC...TTACTTCCTCTAAATGACCAAG \
  -o "output_{name}_R1.fastq" \
  -p "output_{name}_R2.fastq" \
  input_R1_001.fastq \
  input_R2_001.fastq

In our case, the demultiplexing resulted in quite a few reads containing sequences with a length of zero bp which created some problems for downstream analysis. Guess it would be a good idea to add

- m 1

to the command (or whatever length specification toher than 1) to discard zero bp reads.

Cheers Nauras

marcelm commented 1 week ago

This appears to have been resolved, closing.