Closed amy-houseman closed 1 year ago
What you see is not unexpected because the options you use are not explicitly intended for removing N
bases. Removing adapters with -a
gets rid of adapters only and the -q
quality-trimming option trims low-quality 3' ends. If there are low-quality bases (which the N
s probably are) in the middle, quality trimming won’t catch them.
For the N
bases in the 5' end, you could specify a second quality-trimming threshold as in -q 20,30
, then one is used for the 5' end and the second for the 3' end.
For N
bases that remain in the middle, you could use the --max-n-count
option, see https://cutadapt.readthedocs.io/en/stable/guide.html#dealing-with-ns, but note that this discards the entire read pair if one of the reads has too many N
bases. You would need to decide whether this is the right thing to do.
Thank you lots for your reply!
My only problem is I didn't spot this before so carried on with downstream processing - I used BWA-mem for aligning - do you think the the data above would prove problematic despite only those in the peak at 59 being above 5%?
Thank you lots! Amy
I think you’re probably fine. The N
s in the middle are unproblematic IMO. BWA-MEM will still be able to align the read and any well-behaved tool you use afterwards will see that an N
was aligned and will not give you a spurious variant call or so. The untrimmed 5' ends should not be a problem either because BWA-MEM soft-clips ends with too many mismatches.
Thank you lots, especially for your fast reply! You have put my mind at ease. BW, Amy
Cutadapt v3.5 and Python version
Container
Command:
singularity exec $CONTAINER cutadapt -q 30 -m 35 -a TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG -A GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -o $OUTPUTFASTQFORWARD -p $OUTPUTFASTQREVERSE $INPUTFASTQFORWARD $INPUTFASTQREVERSE > $TXTREPORT"
If you report unexpected trimming behavior, this would also be helpful:
I have ran this similar command through a few different samples using their corresponding adapter sequences, according to FastQC and multiQC for some of my samples there are still N's being called (now I know that the orange peak is from 12 forward reads which were all sequenced on the same day, same lane, same sequencing company, same exome capture kit). But I am also concerned about the other peaks that remain.
Picture before and after cutadapt across 102 individuals forward and reverse fastq files.
Thanks! Amy