marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Trouble removing RevComp reads in the FWD.ForwardReads orientation and Forward reads in the FWD.ReverseReads orientation #723

Closed pennelli19 closed 1 year ago

pennelli19 commented 1 year ago

The data set I am using was sequenced by a professional service and I was given OTUs as the output. However, I am interested in using the fastq files for ASV identification. Since it was a service, the primers are considered proprietary and I am having a difficult time getting that information. I took a guess at common primers used for the V3-V4 region of the 16S gene and it seems like this might have been either the primer used or something very similar.

FWD = "CCTACGGGNGGCWGCAG" REV = "GACTACHVGGGTATCTAATCC"

The following was my original output:

primerHits = function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits = vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
             Forward Complement Reverse RevComp

FWD.ForwardReads 92019 0 0 190 FWD.ReverseReads 228 0 0 207 REV.ForwardReads 69 0 0 0 REV.ReverseReads 0 0 0 264308

After using the cutadapt package and the following code, the output now looks like the following:

cutadapt = "/Applications/miniconda3/envs/cutadapt/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R

path.cut = file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut = file.path(path.cut, basename(fnFs))
fnRs.cut = file.path(path.cut, basename(fnRs))

FWD.RC = dada2:::rc(FWD)
REV.RC = dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags = paste("-g", FWD, "-a", REV.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags = paste("-G", REV, "-A", FWD.RC) 
# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 5, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
                             fnFs.filtN[i], fnRs.filtN[i])) # input files
}

Forward Complement Reverse RevComp FWD.ForwardReads 0 0 0 188 FWD.ReverseReads 226 0 0 0 REV.ForwardReads 0 0 0 0 REV.ReverseReads 0 0 0 0

Most of the primers were removed, but I can't really seem to remove the RevComp of the FWD.Forwad reads and the Reverse of the FWD.Reverse reads. Do you have any suggestions for removing the last primer sequences? I already tried increasing the "-n" to 5 as suggested in another solution, but this did not change the final output.

Thank you for the help!

marcelm commented 1 year ago

The following is an opinion born out of experience with other types of data, but I should note I don’t use DADA2.

I think you should not focus too much on getting all the numbers in that particular table to zero. You already got rid of ~99.9% of primer occurrences, which is really very good. There are always weird things going on in biological data and it is likely that some which are not measured by that table have a greater influence on your result than the 0.1% remaining primers (which occur in unexpected places anyway).

So IMO: It looks good, just continue with your analysis.

pennelli19 commented 1 year ago

Hi Marcel,

Thank you so much for the feedback. Unfortunately, once all of my sequences are merged, most of them are still being identified as chimeras. Given the uncertainty with the primer sequences, I don't think it is wise to use this method of analyzing the data. However, I was able to remove all the primers of the sequences I identified above and have included that code below in case it might help someone else in the future!

FWD = "CCTACGGGNGGCWGCAG" REV = "GACTACHVGGGTATCTAATCC"

primerHits = function(primer, fn) {
    # Counts number of reads in which the primer is found
    nhits = vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
    return(sum(nhits > 0))
}
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), FWD.ReverseReads = sapply(FWD.orients,
    primerHits, fn = fnRs.filtN[[1]]), REV.ForwardReads = sapply(REV.orients, primerHits,
    fn = fnFs.filtN[[1]]), REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))

Forward Complement Reverse RevComp FWD.ForwardReads 92019 0 0 190 FWD.ReverseReads 228 0 0 207 REV.ForwardReads 0 0 0 69 REV.ReverseReads 264308 0 0 0

cutadapt = "/Applications/miniconda3/envs/cutadapt/bin/cutadapt" # CHANGE ME to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R

path.cut = file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut = file.path(path.cut, basename(fnFs))
fnRs.cut = file.path(path.cut, basename(fnRs))

FWD.RC = dada2:::rc(FWD)
REV.RC = dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags = paste("-g", FWD, "-a", REV.RC, "-g", FWD.RC) 
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags = paste("-G", REV, "-A", FWD.RC, "-G", FWD) 
# Run Cutadapt
for(i in seq_along(fnFs)) {
  system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
                             "-o", fnFs.cut[i], "-p", fnRs.cut[i], "-m", 1,  # output files
                             fnFs.filtN[i], fnRs.filtN[i])) # input files
}
             Forward Complement Reverse RevComp

FWD.ForwardReads 0 0 0 0 FWD.ReverseReads 0 0 0 0 REV.ForwardReads 0 0 0 0 REV.ReverseReads 0 0 0 0

marcelm commented 1 year ago

Thanks for the feedback! Closing this issue then.