marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
523 stars 130 forks source link

Tune the NextSeq quality trimming algorithm #297

Open kkanger opened 6 years ago

kkanger commented 6 years ago

Hi, I seem to have the same problem as described in the closed issue #244 using --nextseq-trim on paired-end data. Poly-G tails in reverse reads were not removed although I specified the --nextseq-trim=20 option. My full command was:

cutadapt --nextseq-trim=20 --max-n 0 --minimum-length 35 -o trimmed/sample.R1_trimmed.fastq -p trimmed/sample.R2_trimmed.fastq sample.R1.fastq sample.R2.fastq

I used cutadapt v1.16 with Python 2.7.12 installed through pip.

marcelm commented 6 years ago

I tried this, but I’m not able to reproduce the problem on my test data. With issue #244 fixed, this shouldn’t happen.

As a test, can you swap the R1 and R2 input and output files and see what happens? Which of the reads get trimmed? R1, R2, both or none?

Which paired-end mode does cutadapt say that it uses? The fourth output line should look something like this:

Trimming 0 adapters with at most 10.0% errors in paired-end mode ...

or

Trimming 0 adapters with at most 10.0% errors in paired-end legacy mode ...

I would like to know whether it says "legacy" or not.

To continue with this, it would also be helpful if you could provide a minimal dataset (just a single read pair) that exposes the problem?

marcelm commented 6 years ago

I realized just now that cutadapt doesn’t report quality-trimming statistics when --nextseq-trim is used. This is now fixed.

kkanger commented 6 years ago

Hey,

thanks for your reply!

Firstly, cutadapt runs in regular paired-end mode (not legacy).

Secondly, switching the forward and reverse reads had no effect on the output.

Thirdly, I have a suggestion where the problem might stem from. Cutadapt efficiently removes the polyGs from the end of the read, but the polyGs at the beginning of the read remain in the output. When I analyze my fastq files (after running cutadapt) with FastQC I see that the number of polyG overrepresented sequences has decreased, but not to zero. I attach an example read pair with polyGs at the beginning of reverse reads that were not trimmed.

Do you think there's a good way to get rid of the polyGs or would it be easier to simply discard the reads
(it doesn't change my dataset size too much).

sample_polyG.zip

marcelm commented 6 years ago

Thanks for the test data. Yes, your observation is correct: Nextseq-specific quality trimming removes bases only from the 3' end. This is by design since the “dark cycles” (encoding G nucleotides) occur when the sequencer reaches the end of the DNA molecule, and because the sequencer proceeds towards the 3' end, only a 3' end (read suffix) should be affected.

The problem in the read pair you sent is that the reverse read has a high-quality T as second-to-last nucleotide. The quality-trimming algorithm proceeds from the 3' end towards the 5' end and will stop when it encounters that high-quality T. When you run your read pair through cutadapt with --nextseq-trim=20, you will see that it actually removes the very last G from the reverse read.

I assume that the T is a sequencing error and that there was actually no DNA molecule at the spot on the flowcell.

At the moment, I think I don’t want to change the algorithm to take care of this special case because I would assume that it is a problem in your dataset only. I would definitely need to see more examples, also from other datasets.

To proceed, you could possibly just ignore the problem. The affected reads will just not be mapped by the mapper.

If you do want to trim these reads, you can try to add the options -a G{20} -e 0.0 to the cutadapt command-line, which would make it explicitly search for poly-Gs and trim them. This isn’t quite perfect because it searches for a stretch of 20 G anywhere in the read, but it should work most of the time.

kkanger commented 6 years ago

Thanks for the explanation and suggestions! I'll try trimming the reads as you suggested :)

marcelm commented 6 years ago

You’re welcome! I’ll close this issue then. Feel free to open a new one if anything else comes up.

ChristianRohde commented 2 years ago

Hi, I am very sorry to open this old issue again, but I have a similar problem: I have an issue with 6% reads in read2 from PE data being polyG (reads only containing G!). Using --nextseq-trim=20 I can go down to 0.13% which is very good. How can I remove the rest? Should I run the -a G{20} -e 0.0 command in a pipe as a second cutadapt round or is there a trick to do the magic in one go?

marcelm commented 2 years ago

I’ll re-open this as it appear that the --nextseq-trim algorithm needs to be tuned a little bit. Would you be willing to share one or two of your sequences that are problematic?

You should be able to just add -a G{20} -e 0 to the same command line. I think I would actually recommend using -a G{20} -e 1 to allow one error.

ChristianRohde commented 2 years ago

right, I should give you some examples. I copied some reads which seem to be problematic into example input files. you find them attached in the ZIP. However, I am not sure anymore if I would like to exclude all. Things might be more complicated than I thought. But maybe the most obvious ones should be removed, which could be done with the command you showed above. I tried this, but I had some trouble adding the polyG removal to cutadapt. It seems that I can get rid of these polyG reads, but unfortunately it seems that more adapter is left if I include this to the command. Please have a look at the initial fastqc: Screenshot 2022-02-28 at 12 21 17

Next, the fastqc after trimming (cutadapt -m 20 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -o out_R1.fastq.gz -p out_R2.fastq.gz input_R1.fq.gz input_R2.fq.gz): Screenshot 2022-02-28 at 12 21 06

And finaly after trimming adding the polyG removal (cutadapt -m 20 --nextseq-trim=20 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT -a G{20} -A G{20} -e 1 -o out_R1.fastq.gz -p out_R2.fastq.gz input_R1.fq.gz input_R2.fq.gz): Screenshot 2022-02-28 at 12 20 57

I decided that for the moment we are better off performing the normal cutadapt.

Best, Christian

cutadapt_example_fastq.zip

marcelm commented 2 years ago

Thanks for the files. Modifying the algorithm will probably take some time, so no promises I can do this soon.