marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
514 stars 129 forks source link

Add an option for maximum average error rate (--max-aer) #725

Closed rhpvorderman closed 1 year ago

rhpvorderman commented 1 year ago

I did some tests on GM24385_1.fastq.gz which is one of the nanopore GIAB samples. It has about 3 million records ranging from 3bp to over a million bp. --max-ee is unsuitable for this use case. A max-ee of 3 would be absolute garbage for the 3bp reads and stellar never seen before quality for the 1 million bp reads.

This pr adds a --max-average-error-rate/--max-aer option to alleviate this problem. Using a --max-aer 0.1 option for instance to filter out the very worst long read sequencing reads seems like a decent practice.

I will be off until coming tuesday, so I won't be able to respond as promptly as I usually do. I hope you will consider this PR as it will make using cutadapt for long-read sequencing much more viable.

rhpvorderman commented 1 year ago

All done. Technically this is my day off, but I wanted to get some coding done for fun, so here you go. I don't know if I will respond to any re-review comments until tuesday.

rhpvorderman commented 1 year ago

I am back. I spent my downtime on the weekend trying to accelerate the expected errors computation and I failed miserably. I did some exciting stuff using avx2 gather instructions but it didn't help much. In the end the only meaningful improvement I was able to generate was due to doing the phred boundscheck using sse2 instructions as well as some loop unrolling. The speedup was 10-15% though, not really worth all the extra hassle. The lookup is already quite fast, but puts a hard limit on the throughput. The fastest exp10 implementations also use lookups internally so there is nothing to be gained going that route. I'd love to have made a little companion PR speeding up expected_errors as well, but alas. Turns out it made a massive difference after all. I am quite delighted.

marcelm commented 1 year ago

Ok, thinking about this, I’m in favor of the feature, but at the moment somewhat reluctant to add yet another command-line option. I’m thinking this should somehow reuse the --max-ee option, just like --max-n can be a count or a fraction, but it’s hard to see how. One could use negative values: --max-ee=-0.5 would be the same as what --max-aer=0.5 is now. But that’s not very readable I admit.

Maybe you have a better idea? If not, I guess we’ll need to go with the extra option.

rhpvorderman commented 1 year ago

Ok, thinking about this, I’m in favor of the feature, but at the moment somewhat reluctant to add yet another command-line option.

Well it is new functionality. I do understand the reluctance. The truth is that cutadapt is very versatile and useful and an unmanageable amount of command line options is just a manifestation of that reality. See also GATK.

Maybe you have a better idea?

Well, I could make my own mean and lean tool that does just this and proliferate bioinformatics tools rather than command line options :wink:. in fact I already have a tool. I am going to retire that once this feature is merged. I think the bioinformatics scene will benefit from having less tools rather than more. So I vote for making a tool obsolete if the cost is an extra command line option. I think cutadapt can survive an extra command line option, but it is your decision.

rhpvorderman commented 1 year ago

Review comments have been addressed.

marcelm commented 1 year ago

Well, I could make my own mean and lean tool that does just this and proliferate bioinformatics tools rather than command line options 😉.

I meant "how can this become part of Cutadapt without adding another command-line option to it".

But it’s fine, let’s do this.

rhpvorderman commented 1 year ago

I meant "how can this become part of Cutadapt without adding another command-line option to it".

Yes, I know. I also couldn't think of any better options. I should have stated that explicitly. Sorry.

Thanks for merging!