marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
514 stars 129 forks source link

Clarification of "Reads that were too long" #722

Closed HenriettaHolze closed 1 year ago

HenriettaHolze commented 1 year ago

Hi Marcel,

I don't understand the classification of reads as "too-long".
In the log file, 14% of my reads are too long but none are discarded as untrimmed.
When writing untrimmed and "too long" sequences to file, none are written to untrimmed.
The sequences written to "too long" don't contain any adapter sequence.
Thus, I don't understand why they are flagged as "too long".
Sequences are ultimately trimmed correctly but for QC purposes, I would like to understand the reasoning better.

Thanks a lot!

I'm using cutadapt v4.4 installed with conda.

The command I'm running:

cutadapt -j 10 -g "TGACCATGTACGATTGACTA;o=20" \
--untrimmed-output untrimmed.fastq \
--too-long-output too_long.fastq \
--max-n=0 -m 20 -e 0.1 -M 60 \
JK12-M10-BM-2_S52_R1_001.filtered.fastq.gz > JK12-M10-BM-2_S52_R1_001.trimmed.fastq 2> JK12-M10-BM-2_S52_R1_001.cutadapt_up.log

The beginning of the log file:

This is cutadapt 4.4 with Python 3.9.16
Command line parameters: -j 10 -g TGACCATGTACGATTGACTA;o=20 --trimmed-only --max-n=0 -m 20 -e 0.1 -M 60 JK12-M10-BM-2_S52_R1_001.filtered.fastq.gz
Processing single-end reads on 10 cores ...
Finished in 11.554 s (3.576 µs/read; 16.78 M reads/minute).

=== Summary ===

Total reads processed:               3,231,242
Reads with adapters:                 2,778,436 (86.0%)

== Read fate breakdown ==
Reads that were too short:                  52 (0.0%)
Reads that were too long:              452,806 (14.0%)
Reads with too many N:                     225 (0.0%)
Reads discarded as untrimmed:                0 (0.0%)
Reads written (passing filters):     2,778,159 (86.0%)

Total basepairs processed:   242,343,150 bp
Total written (filtered):    127,426,412 bp (52.6%)

=== Adapter 1 ===

Sequence: TGACCATGTACGATTGACTA; Type: regular 5'; Length: 20; Trimmed: 2778436 times

Minimum overlap: 20
No. of allowed errors:
1-9 bp: 0; 10-19 bp: 1; 20 bp: 2

Sequences that are written to too_long.fastq (sample)

@NB501056:977:H2NWYBGXV:1:11101:21648:1084 1:N:0:CCCAACCT+GTGTCTTA
ATGCGGATCCTGACCATGTACGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCCAACCTATCTCGTATGC
+
AAAAAEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/AEEEEEEEE/AEAEAEAEEEAE/<E/AE/E
@NB501056:977:H2NWYBGXV:1:11101:17369:1088 1:N:0:CCCAACCT+GTGTCTTA
AATGCGTACTGACTAGGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCCAACCTATCTCGTATGCCGTC
+
AAAAAEEEEEEEEEEEEEEE/EEEEEEE/EAEEEEEEE<AAAEEEEEEEEE/EEEE/EA<<////A////AE///
@NB501056:977:H2NWYBGXV:1:11101:22496:1218 1:N:0:CCCAACCT+GTGTCTTA
TATCGGATCCTGACCATGTACGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCCAACCTATCTCGTATGC
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEE/EEEEAEEEEEEE6EAEAAEEEEEEEEE6EEEAE6//E//<//
@NB501056:977:H2NWYBGXV:1:11101:14511:1235 1:N:0:CCCAACCT+GTGTCTTA
AATGCGTACTGACTAGGTACTAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCCAACCTATCTCGTATGCC
+
AAA6//AEEEAEEEEEE/EEEAEEE//EEEEEEEEEEEEEEEEAEEEEEEEEEAEA/A<E/6/AA//AAA///6E
@NB501056:977:H2NWYBGXV:1:11101:5061:1245 1:N:0:CCCAACCT+GTGTCTTA
CACGGATCCTGACCATGTACGCTAATGCGTACTGACTAGGCATAGATCGGAAGAGCACACGTCTGAACTCCAGTC
+
AAAAAEEEEEEEEEEEEAEEEEEEEEEEEE6EEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEAEEEEEEE
@NB501056:977:H2NWYBGXV:1:11101:6016:1247 1:N:0:CCCAACCT+GTGTCTTA
AATGCGTACTGACTAGGACAGATCGGAAGAGCACACGTCTGAACTCCAGTCACCCCAACCTATCTCGTATGCCGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAE</AEE/E//AAA

Cheers, Henrietta

marcelm commented 1 year ago

For each read, Cutadapt does two things:

  1. Modify the read: Apply all requested "read modifications" one by one. This includes adapter trimming, quality trimming, read renaming etc. No reads are removed at this stage.
  2. Filter the read: Check whether any of the requested filtering criteria apply and if so, discard the read (or redirect it to a different output file, as for example with --too-long-output).

The filtering criteria are always checked in a fixed order. (The order in which the command-line options are listed under the "Filtering of processed reads" heading in the cutadapt --help output).

The "Read fate breakdown" section in the log shows how many reads made it how far through those filtering steps:

== Read fate breakdown ==
Reads that were too short:                  52 (0.0%)
Reads that were too long:              452,806 (14.0%)
Reads with too many N:                     225 (0.0%)
Reads discarded as untrimmed:                0 (0.0%)
Reads written (passing filters):     2,778,159 (86.0%)

If a filtering criterion applies, the read is discarded and does not continue to the other filters. And this is what is happening here:

Your maximum length filtering criterion is -M 60, but also your input reads have a length of 75 bp. This means that an untrimmed read will always be filtered by the "too long" filter because that is applied before the trimmed-only filter.

HenriettaHolze commented 1 year ago

Thanks a lot for the clarification!