marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

trimgalore and cutadapt in two steps? #657

Closed marwa38 closed 1 year ago

marwa38 commented 1 year ago

Hi team I know that trimgalore is a wrapper around cutadapt but I want to check what I ran is considered fine I ran trimgalore to remove adaptors and primers at 5' and 3' ends but then

trim_galore -q 30 --phred33 --paired --gzip --fastqc --fastqc_args "--nogroup" -o ./trimGalored --length 20 --stringency 1 -e 0 --clip_R1 17 --clip_R2 21 *_S"$i"_L001_R1_001.fastq.gz *_S"$i"_L001_R2_001.fastq.gz

I found that I need to check for primers within the sequences themselves and I found and I removed them Things are ok that I did my analysis in that way? I don't want to reran the analysis again but I want to make sure that this is fine.

# Place filtered files in filtN/ subdirectory
fnFs.filtN <- file.path(path, "filtN", basename(fnFs)) # Put N-filterd files in filtN/ subdirectory
fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = TRUE)

#Check if primers are still inside (N, W and H nucleotides are identified by cutadapt)
FWD <- "CCTACGGGNGGCWGCAG"  
REV <- "GACTACHVGGGTATCTAATCC"

allOrients <- function(primer) {
  # Create all orientations of the input sequence
  require(Biostrings)
  dna <- DNAString(primer)  # The Biostrings works w/ DNAString objects rather than character vectors
  orients <- c(Forward = dna, Complement = complement(dna), Reverse = reverse(dna), 
               RevComp = reverseComplement(dna))
  return(sapply(orients, toString))  # Convert back to character vector
}
FWD.orients <- allOrients(FWD)
REV.orients <- allOrients(REV)
FWD.orients

# now ready to count the number of times the primers appear in the forward and reverse read, while considering all possible primer orientations. 
#Identifying and counting the primers on one set of paired end FASTQ files is sufficient, assuming all the files were created using the same library preparation, so we’ll just process the first sample.

primerHits <- function(primer, fn) {
  # Counts number of reads in which the primer is found
  nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
  return(sum(nhits > 0))
}
library (ShortRead)
rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]), 
      FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]), 
      REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]), 
      REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
#                   Forward Complement Reverse RevComp
# FWD.ForwardReads       0          0       0       0
# FWD.ReverseReads       0          0       0     112
# REV.ForwardReads       0          0       0       8
# REV.ReverseReads      99          0       0       0
marcelm commented 1 year ago

Unfortunately, I don’t know what the trim_galore command that you listed does under the hood. Can you perhaps talk to the TrimGalore developers about this?

marwa38 commented 1 year ago

@marcelm I asked as was kindly adviced the TrimGalore team https://github.com/FelixKrueger/TrimGalore/issues/147 Please share you comments regarding cutadapt part if any? that would be helpful.

Thanks in advance Marwa :)

marcelm commented 1 year ago

If you describe what you want to do, I can try to help. You would need to state which type of data you have, which adapters you want to remove and which other processing you think should be done.

marwa38 commented 1 year ago

@marcelm Thanks Marcel for your inputs in advance and for the discussion For my 16S rRNA microbiota dataset I want to achieve the following 1- remove the adaptors (Nexetra) which were successfully done using trimGalore above code I shared. 2- remove primers (the name of primers is V3-V4) at 5' and 3' ends (I used trimGalore params as shared in the above codes --clip_R1 17 --clip_R2 21 ) as you can see the number of base pairs in Forward primer is 17 and Reverse primer is 21

Forward: CCTACGGGNGGCWGCAG 
Reverse: GACTACHVGGGTATCTAATCC

3- remove the same primers (V3-V4; their specific sequences are attached above) if they are present within the sequenced data (I used the above cutadapt codes as the following step after trimGalore steps) Please let me know if need more info. Hope that made things clearer?

marcelm commented 1 year ago

Have a look at this section in the documentation: https://cutadapt.readthedocs.io/en/stable/recipes.html#trimming-amplicon-primers-from-paired-end-reads

If I googled correctly, the V3-V4 primers target a region that is about 460 bp long, so longer than your read length. That means that you can use the somewhat simpler first command suggested in that section. In your case:

cutadapt -j 8 -g ^CCTACGGGNGGCWGCAG -G ^GACTACHVGGGTATCTAATCC --discard-untrimmed -o out.1.fastq.gz -p out.2.fastq.gz in.1.fastq.gz in.2.fastq.gz

I am not 100% sure, but I think you don’t need to trim Nextera adapters at all. The adapters should only be part of the read when the sequenced fragment is shorter than the read length, but your fragments/amplicons should all be longer. You may have had some adapters in your data, but I would guess that these are from fragments that are not amplicons anyway. With the above command, any read pair that does not have one of the primer sequences is discarded, so the output should not contain any more adapters.

marcelm commented 1 year ago

Is it ok to close this issue?