marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Primer sequence trimming. Cutadapt does not cut all of them #798

Open Guillermouceda opened 3 months ago

Guillermouceda commented 3 months ago

Hello,

I am relatively new to Cutadapt and therefore this question might have been asked before or it may seem stupid.

I am using Cutadapt to trim the primer sequences from several metabarcoding libraries produced with Illumina MiSeq 600 v3. The sequences are already demultiplexed and split into the original samples. I am running Cutadapt with the following settings:

My libraries consist of sequences coming from pollen collected from bees. I have used the following primers:

I have 175 libraries and therefore I have 350 pair-end files (R1 and R2 files). Thus, I have run Cutadapt in loop like:

for i in *_R1.fastq.gz
do
      FILENAME=`basename $i _R1.fastq.gz`;
      echo $FILENAME;
      cutadapt -g ATGCGATACTTGGTGTGAAT -G GACGCTTCTCCAGACTACAAT \
      -m 100 -M 350 \
      --match-read-wildcards \
      --pair-filter=both -q 10 \
      -o ${FILENAME}_R1_trimmed.fastq.gz -p ${FILENAME}_R2_trimmed.fastq.gz\
      ${FILENAME}_R1.fastq.gz ${FILENAME}_R2.fastq.gz
      done

Cutadapt runs smoothly and I do not get any issues. However, I still find some of my primer sequences after counting them with:

zgrep -c ATGCGATACTTGGTGTGAAT *_R1_trimmed.fastq.gz >> primers_Fw_in_reads_r1_trimmed.txt zgrep -c GACGCTTCTCCAGACTACAAT *_R2_trimmed.fastq.gz >> primers_Rev_in_reads_r2_trimmed.txt zgrep -c ATGCGATACTTGGTGTGAAT *_R2_trimmed.fastq.gz >> primers_fw_in_reads_r2_trimmed.txt zgrep -c GACGCTTCTCCAGACTACAAT *_R1_trimmed.fastq.gz >> primers_rev_in_reads_r1_trimmed.txt

In the forward read (R1) I still find a few Fwd primer sequences, as well as I find a few Rev primer sequences in the R2 reads. Is this normal? I am talking about a few 100s in some samples. But what is most striking is that all the Fwd primer sequences in the R2 files are still there, and the same happens for the Rev primer sequences in the R1 reads. Is there something that I am not understanding and this is normal?**Is it OK to run again Cutadapt over the output of the first run with the same settings?**

I provide the .txt files were the count of each primer sequences was done for each library before/after running Cutadapt and the read count of each library.

Read counts per library file: read.counts.txt

Primer sequence count before Cutadapt: primers_fw_in_reads_r2.txt primers_rev_in_reads_r1.txt primers_Rev_in_reads_r2.txt primers_Fw_in_reads_r1.txt

Primer sequence count after Cutadapt: primers_fw_in_reads_r2_trimmed.txt primers_rev_in_reads_r1_trimmed.txt primers_Rev_in_reads_r2_trimmed.txt primers_Fw_in_reads_r1_trimmed.txt

Thank you very much in advance.

marcelm commented 3 months ago

In the forward read (R1) I still find a few Fwd primer sequences, as well as I find a few Rev primer sequences in the R2 reads. Is this normal? I am talking about a few 100s in some samples.

This is probably normal. I believe the extra occurrences are from reads that contain two primers. Cutadapt by default trims the leftmost one, so additional occurrences remain in the read.

But what is most striking is that all the Fwd primer sequences in the R2 files are still there, and the same happens for the Rev primer sequences in the R1 reads. Is there something that I am not understanding and this is normal?

This is expected: The sequence that you provide with the (lowercase) -g option is searched only in the R1 reads, and the sequence that you provided with the (uppercase) -G option is searched only in the R2 reads.

If you want to search the respective other read as well, you need to use the --revcomp option. Use this only if you are sure that you can have inserts in both directions.

Is it OK to run again Cutadapt over the output of the first run with the same settings?

It does not make that much sense to me. I have not worked with metabarcoding data, so I’m not entirely sure how your data is actually structured, but I would probably run Cutadapt in a slightly different way. For both the -g and the -G options, I would use anchored adapters, that is, write them like this: -g ^ATGCGATACTTGGTGTGAAT -G ^GACGCTTCTCCAGACTACAAT because I would expect them to be exactly at the 5' end of the read. With the ^ character, you allow partial matches and also matches within the read. Also, I would probably use --discard-untrimmed in order to ensure I only get the reads that actually start with a primer.

If you do that, you don’t need the -m 100 -M 350 --pair-filter=both options. Also, using the --match-read-wildcards option is a bit unusual, you may want to leave it out unless you have a reason for needing it.