marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 126 forks source link

NextSeq Quality Trimming Issue #758

Closed gfill88 closed 4 months ago

gfill88 commented 5 months ago

Hello,

I am using cutadapt to trim barcodes and filter reads (based on length and quality) from NextSeq1000 data. I am using cutadapt v4.5, which is currently installed as a conda environment. Below is the command line that I used:

cutadapt -e 0 --no-indels --pair-adapters --nextseq-trim=20 --minimum-length 40 \
-g ^file:cutadaptfmt.fasta \
-G ^file:cutadaptfmt.fasta \
-o {name}.1.fastq -p {name}.2.fastq \
TEST_S1_R1_001.fastq TEST_S1_R2_001.fastq

The output is attached with the summary posted below:

=== Summary ===

Total read pairs processed:        102,400,773
  Read 1 with adapter:              85,598,549 (83.6%)
  Read 2 with adapter:              85,598,549 (83.6%)

== Read fate breakdown ==
Pairs that were too short:             125,083 (0.1%)
Pairs discarded as untrimmed:                0 (0.0%)
Pairs written (passing filters):   102,275,690 (99.9%)

Total basepairs processed: 30,925,033,446 bp
  Read 1: 15,462,516,723 bp
  Read 2: 15,462,516,723 bp
**Quality-trimmed:           1,658,317,748 bp (5.4%)
  Read 1:   319,675,408 bp
  Read 2: 1,338,642,340 bp**
Total written (filtered):  27,877,897,130 bp (90.1%)
  Read 1: 14,442,133,932 bp
  Read 2: 13,435,763,198 bp

It looks like some reads were quality trimmed. When I process the trimmed datasets using fastqc, I see that the blue mean value is well above my Q20 threshold, however, the lower whiskers extend below Q20, which I wasn't expecting. See attached doc with fastqc per base sequence quality plots. This might be a simple misinterpretation on my end but wanted to ask. Thanks in advance!!!

Fastqc_charts.docx

cutadapt_output_summary.txt

marcelm commented 5 months ago

Hi, I think the report looks fine.

First, note that the FastQC chart is somewhat misleading when you run FastQC on reads that have variable lengths – such as those that you get after quality trimming or adapter removal. The reason is that you cannot see how many bases each of the boxplots is based on. For example, the statistics for lengths 140-143 are very likely based on a lower number of reads.

Second, the quality trimming algorithm that Cutadapt uses sometimes does not trim as much as a human would when looking at the quality values. In particular, when the last (most 3') base of the read has a quality value above the threshold, the read is not trimmed at all no matter how low the quality values are that come before it. You may see some of these cases in that chart.

I think the whisker going below the trimming threshold can be explained by the box plots showing merged statistics for multiple read lengths. For example, if one read of length 143 has a high-quality base at the end that is preceded by a low-quality one, it doesn’t get trimmed, but both the low- and high-quality base end up in the same boxplot for lengths 140-143.

gfill88 commented 4 months ago

Thank you so much for this feedback and explanation! It makes sense that the lower whisker would fall below the trimming threshold when a there are a variety of read lengths. Thanks for making the time to clarify this for me! Very much appreciated.

marcelm commented 4 months ago

Sure, no problem! Closing this now, feel free to comment or open a new issue if there are further questions.