Optimizing filtering when a MINSIZE is given by user

marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads

https://cutadapt.readthedocs.io

MIT License

514 stars 129 forks source link

Optimizing filtering when a MINSIZE is given by user #149

Closed glihm closed 9 years ago

glihm commented 9 years ago

Hi Marcel,

I am a new user of cutadapt, and you've done a really great python program!

I did not read the whole code, and I was wondering if an optimization like the one I will describe is implemented in cutadapt.

For example, when user mentioned a MINSIZE which is greater than (READSIZE - ADAPTERSIZE), it means that when there is a full match (For instance when "if adapter.sequence in read.sequence" returns true), there is no need to do the whole algorithm, this read can be discarded directly.

I am telling that because in the application in ribosome profiling for instance, this is the case, and lot's of reads are discarded because of this (15-20%). So I did a script to remove them before running cutadapt. However, it can be included in your program (if it is not already implemented! :D In this case, I apologize), because I do think that other RNA-Seq applications can need this kind of optimization to save calculation time. :)

marcelm commented 9 years ago

Hi, thanks for your suggestion. Cutadapt does have an optimization where it first searches for an exact, full-length match in the read. If that was found, then it does not run the error-tolerant matching algorithm. I’m not quite sure whether this is what you meant because this optimization applies in all situations, independent of the minimum length parameter. Please tell me if you meant something else.

Do you get a speed improvement with your preprocessing script?

glihm commented 9 years ago

Hi,

thank to you to traduce my idea.. Yes, so your optimization is including the one I had got in my head. I posted it because in some applications as ribosome-profiling, if there is a full-match, it's sure that the read will be discarded, so there is no more test to realize in this read. And I was thinking about a way to remove all other test on this kind of reads to save time.

In my case, I have a speed improvement, because in one sample of 200 millions of reads, my script will remove 25 millions. Therefore, when I am using cutadapt, it works faster just because of this difference of the amount of reads.

I don't know if you understand the idea, and perhaps this one is already implemented as optimization in your code.

marcelm commented 9 years ago

Ok, then this optimization is already in cutadapt. I’m sure that cutadapt is faster if it doesn’t have to process as many reads, but then your script also needs some time to run. However, I am wondering whether you actually safe time in the end since your script also needs some time to run.

glihm commented 9 years ago

All right. Thank you for your answer and for this good program. :)