marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 129 forks source link

Unexpected error counts of 5'- anchored adapter in a linked adapter #615

Closed ForestCat1 closed 2 years ago

ForestCat1 commented 2 years ago

Hi, Marcel, thanks for developing this efficient tool! I'm using Cutadapt 3.7 (installed by conda) with Python 3.8.12, since the structure of my library is

5'-P5ADAPTER-RECsite1-xxx-RECsite2-P7ADAPTER-3' 3'-P5ADAPTER-RECsite1-xxx-RECsite2-P7ADAPTER-5' (sequenced on Illumina NovaSeq, paired-end 150 bp, REC=restriction enzyme cutting, 6 bp for each RECsite)

I ran Cutadapt with following parameters to trim 3'- low quality bases, remove the 5'- RECsite and 3'-RECsite+ADAPTER:

--no-indels --nextseq-trim=10 -O 9 (I have noticed the default setting of -O is 3, and -O has no effect on anchored adapter, so I added the fixed length of RECsite) -a ^RECsite1...RECsite2AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A ^RECsite2...RECsite1AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o /path/out1.fq.gz -p /path/out2.fq.gz -m 30 /path/Read1.fq.gz /path/Read2.fq.gz

Then I got this report: Processing reads on 1 core in paired-end mode ... === First read: Adapter 1 ===

Sequence: RECsite1...RECsite2AGATCGGAAGAGCACACGTCTGAACTCCAGTCA; Type: linked; Length: 6+39; 5' trimmed: 32999357 times; 3' trimmed: 602797 times

Minimum overlap: 6+9 No. of allowed errors: 0

No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3

Overview of removed sequences at 5' end length count expect max.err error counts 6 32999357 8110.2 0 32981415 14556 2551 835

Overview of removed sequences at 3' end (omitted) === Second read: Adapter 2 ===

Sequence: RECsite2...RECsite1AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT; Type: linked; Length: 6+39; 5' trimmed: 32912355 times; 3' trimmed: 613241 times

Minimum overlap: 6+9 No. of allowed errors: 0

No. of allowed errors: 1-9 bp: 0; 10-19 bp: 1; 20-29 bp: 2; 30-39 bp: 3

Overview of removed sequences at 5' end length count expect max.err error counts 6 32912355 8110.2 0 32888926 20516 2388 525

Overview of removed sequences at 3' end (omitted)

I wonder why the removed sequences at 5' end can have non-zero error counts, as the max.err for 5'-anchored RECsite is 0.

By the way, as a "by-product", the polyGs (reads only containing G) reported by fastqc (overrepresented sequence) seemed to be effectively removed since they do not have anchored 5'-RECsite, but actually I have not do 5'- quality trimming this time.

I'm not sure whether the "--nextseq-trim=" can be used as "-q" when I want to perform both 5'- and 3'- quality trimming with NovaSeq data (e.g., --nextseq-trim=15,10). If I can't specify 2 numbers in "--nextseq-trim", will "-q 15,0 --nextseq-trim=10" work well? Because I want to perform 5'- and 3'- quality trimming in one Cutadapt run.

Yours, sincerely

marcelm commented 2 years ago

Thanks for reporting. It turns out that this was a bug. The reported error counts actually show the total number of errors (errors in the 5' adapter plus errors in the 3' adapter). This was of course not the intention and I have fixed it now.

Regarding --nextseq-trim: This option only accepts a single threshold because trimming the 5' end with the algorithm specific to dark-cycles does not make too much sense. To trim the 5' end with the regular algorithm and the 3' end with the NextSeq algorithm, you can indeed use --nextseq-trim=10 -q 15,0, but testing it just now, I see that the reported statistics will be incorrect here as well. I’ve opened #616 for that.

marcelm commented 2 years ago

Just writing to let you know that replying to a closed issue will not re-open it.