Unable to trim primers from my interleaved fastq files

kdbchau commented 1 month ago

Hello. I'm having a problem trying to trim my primers from my interleaved fastq files (R1 and R2 reads have already been merged into each sample file).

I am using Windows Linux Bash Systems, cutadapt (version 2.8) was installed with sudo; no issues there. I am running Python3 version 3.8.10.

Example of a file name is: CONTROL_H12_Rep1_ITS1-F_M13F-ITS2_M13R.fastq. The ITS FWD and REV reads are together here.

The primers I use:

FWD = CTTGGTCATTTAGAGGAAGTAA REV = GCTGCGTTCTTCATCGATGC

Because I was following along with the DADA2 ITS pipeline in R, I had an output for my fastq files that the REV primer was detected as the complement of itself. Example of the output I got:

                 Forward Complement Reverse RevComp
FWD.ForwardReads      70          0       0       0
FWD.ReverseReads      70          0       0       0
REV.ForwardReads       0          0       0      68
REV.ReverseReads       0          0       0      68

As such, I changed REV to be the reverse complement, so:

REV = GCATCGATGAAGAACGCAGC

And then I also have the reverse complements for the primers:

FWD.RC = TTACTTCCTCTAAATGACCAAG
REV.RC = GCTGCGTTCTTCATCGATGC

And here is the code I used in terminal (because the cutadapt command within the dada2 R script wouldn't work in R for me):

for files in *.fastq; do cutadapt --interleaved -g CTTGGTCATTTAGAGGAAGTAA -a GCTGCGTTCTTCATCGATGC -G GCATCGATGAAGAACGCAGC -A TTACTTCCTCTAAATGACCAAG -n 2 -o output/${files%%.fastq}_cut.fastq $files; done

But I get an error message that the reads are improperly paired: cutadapt: error: Error in sequence file at unknown line: Reads are improperly paired. Name 'VH01754:2:AAFFFT2M5:1:1101:35538:7210 1:N:0:1' (first) does not match 'VH01754:2:AAFFFT2M5:1:1101:50365:12113 1:N:0:1' (second).

But I'm not sure how to fix this, because when I look at the names for the reads they appear only once and all end with "1:N:0:1" so it's almost like it had been converted into a single-end read file, but the sequencing was done as paired end. It's very confusing and I was wondering if you know how this may be remedied to ensure both primers are removed? I have tried treating this as single end but it is unclear if it is removing both the FWD and REV primer complements.

marcelm commented 1 month ago

R1 and R2 reads have already been merged into each sample file).

You say "merged", but also mention "interleaved", however, these are two different things, so I’m not sure which one you mean. Merging means to me that the two reads from a read pair are merged into a single-end read like this:

before:
R1  ---------------->
R2          <--------------------
after:
    ---------------------------->

On the other hand, "interleaved" just refers to a way to store paired-end reads in a single file. So while you usually have two files, one with the R1 reads and one with the R2 reads, storing paired-end reads in an interleaved manner means that you only have one file where you store the two reads one after the other so that the order of reads in the file is R1, R2, R1, R2 and so on.

Judging from the error message, I would guess that you have merged data, so the --interleaved option doesn’t apply, neither do the uppercase options -A and -G.

If it is merged data, you may want to trim your data with a linked adapter (see documentation) using -a ^forwardprimer...revcomp-of-reverse-primer (i.e., -a ^CTTGGTCATTTAGAGGAAGTAA...GCTGCGTTCTTCATCGATGC$).

I would also add option --discard-untrimmed so that all reads get discarded that don’t look as expected. The ^ at the beginning of the forward primer make the adapter anchored, which means that it must start at the beginning of the read.

If this does not work, please remove the ^ and $ from the command, re-run it, and paste the report ("length of removed sequences") here.

By the way, your Cutadapt version is quite old (2.8); I recommend using a more recent one.

kdbchau commented 4 weeks ago

Ok I ran it with cutadapt version 4.9 and I do finally see some output. This is my code and output:. I did have to remove the ^ and $ symbols for it to work, otherwise all my files were empty.

Code: for files in *.fastq; do /home/chauk/.local/bin/cutadapt -a CTTGGTCATTTAGAGGAAGTAA...GCTGCGTTCTTCATCGATGC -n 2 -o output2/${files%%.fastq}_cut.fastq $files; done

Output:

Overview of removed sequences at 3' end
length  count   expect  max.err error counts
5       7       46.9    0       7

Not sure if this makes sense, seems like only 7 sequences were removed although previously with the R tutorial it showed that at least 70 reads had a forward primer to it and 68 with reverse.

kdbchau commented 4 weeks ago

Ok I also did the sanity check but the reverse primers are not removed.

> # Sanity check, count the presence of primers in the first cutadapt-ed sample
> rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = myfiles.cut[[1]]), 
+       FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = myfiles.cut[[1]]), 
+       REV.ForwardReads = sapply(REV.orients, primerHits, fn = myfiles.cut[[1]]), 
+       REV.ReverseReads = sapply(REV.orients, primerHits, fn = myfiles.cut[[1]]))
                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0      68
REV.ReverseReads       0          0       0      68

So I will try this again because I think my reverse primer needs to be the reverse complement instead (since it's matching to RevComp). But At least the forward primers are removed!

kdbchau commented 4 weeks ago

Minor update, yes changing the REV primer to its REV.Comp worked and now my sanity check output is:

                 Forward Complement Reverse RevComp
FWD.ForwardReads       0          0       0       0
FWD.ReverseReads       0          0       0       0
REV.ForwardReads       0          0       0       0
REV.ReverseReads       0          0       0       0

marcelm / cutadapt

Unable to trim primers from my interleaved fastq files #800