marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

Why did CutAdapt not remove all my N's? #712

Closed amy-houseman closed 1 year ago

amy-houseman commented 1 year ago

singularity exec $CONTAINER cutadapt -q 30 -m 35 -a TCGTCGGCAGCGTCAGATGTGTATAAGAGACAG -A GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAG -o $OUTPUTFASTQFORWARD -p $OUTPUTFASTQREVERSE $INPUTFASTQFORWARD $INPUTFASTQREVERSE > $TXTREPORT"

If you report unexpected trimming behavior, this would also be helpful:

I have ran this similar command through a few different samples using their corresponding adapter sequences, according to FastQC and multiQC for some of my samples there are still N's being called (now I know that the orange peak is from 12 forward reads which were all sequenced on the same day, same lane, same sequencing company, same exome capture kit). But I am also concerned about the other peaks that remain.

Picture before and after cutadapt across 102 individuals forward and reverse fastq files.

Picture 1

Thanks! Amy

marcelm commented 1 year ago

What you see is not unexpected because the options you use are not explicitly intended for removing N bases. Removing adapters with -a gets rid of adapters only and the -q quality-trimming option trims low-quality 3' ends. If there are low-quality bases (which the Ns probably are) in the middle, quality trimming won’t catch them.

For the N bases in the 5' end, you could specify a second quality-trimming threshold as in -q 20,30, then one is used for the 5' end and the second for the 3' end.

For N bases that remain in the middle, you could use the --max-n-count option, see https://cutadapt.readthedocs.io/en/stable/guide.html#dealing-with-ns, but note that this discards the entire read pair if one of the reads has too many N bases. You would need to decide whether this is the right thing to do.

amy-houseman commented 1 year ago

Thank you lots for your reply!

My only problem is I didn't spot this before so carried on with downstream processing - I used BWA-mem for aligning - do you think the the data above would prove problematic despite only those in the peak at 59 being above 5%?

Thank you lots! Amy

marcelm commented 1 year ago

I think you’re probably fine. The Ns in the middle are unproblematic IMO. BWA-MEM will still be able to align the read and any well-behaved tool you use afterwards will see that an N was aligned and will not give you a spurious variant call or so. The untrimmed 5' ends should not be a problem either because BWA-MEM soft-clips ends with too many mismatches.

amy-houseman commented 1 year ago

Thank you lots, especially for your fast reply! You have put my mind at ease. BW, Amy