najoshi / sickle

Windowed Adaptive Trimming for fastq files using quality
MIT License
219 stars 95 forks source link

differerence in number of kept reads between with and without 5' trimming? #24

Open biocyberman opened 10 years ago

biocyberman commented 10 years ago

Hi I tested the program on 3 libraries. The output is counter-intuitive for me: With 5-prime trimming:

FastQ records kept: 48193327 FastQ records discarded: 51125213

FastQ records kept: 92263367 FastQ records discarded: 97344743

FastQ records kept: 146668253 FastQ records discarded: 154683227

No 5-prime trimming: FastQ records kept: 48175971 FastQ records discarded: 51142569

FastQ records kept: 92226296 FastQ records discarded: 97381814

FastQ records kept: 146615764 FastQ records discarded: 154735716

You can see that no 5-prime trimming results in less reads then with 5-prime trimming. I wonder why is that.

biocyberman commented 10 years ago

My current finding: if I set -x flag, sickle will drop entirely the reads with first 5' low-qual window. This makes sickle drops more reads with -x flag. This can be a desirable feature or a bug. For colorspace sequence, it is rather a feature, because we don't want to trim 5' ends for sequences with prefix bases.

However, for handling -x flag properly, I think it would be better if sliding_windows works on both directions: from 5' to find 5' cut, and from 3' for 3' cut. This would slow down the program a bit, but it is more robust and perceivable.

najoshi commented 10 years ago

Hello,

Sickle wasn't written with colorspace reads in mind so I can't really say why it would be behaving strange with those reads. If you create a robust solution, maybe we could integrate it into sickle.

On Thu, May 8, 2014 at 3:23 AM, biocyberman notifications@github.comwrote:

My current finding: if I set -x flag, sickle will drop entirely the reads with first 5' low-qual window. This makes sickle drops more reads with -x flag. This can be a desirable feature or a bug. For colorspace sequence, it is rather a feature, because we don't want to trim 5' ends for sequences with prefix bases.

— Reply to this email directly or view it on GitHubhttps://github.com/najoshi/sickle/issues/24#issuecomment-42534622 .

Nikhil Joshi Bioinformatics Analyst/Programmer UC Davis Bioinformatics Core http://bioinformatics.ucdavis.edu/ najoshi -at- ucdavis -dot- edu 530.752.2698 (w)