marcelm / cutadapt

Cutadapt removes adapter sequences from sequencing reads
https://cutadapt.readthedocs.io
MIT License
502 stars 125 forks source link

Filtering reads based on low-quality base ratio #699

Open gilstel opened 1 year ago

gilstel commented 1 year ago

Is there a way to filter reads based on low-quality base ratio of the read? Many thanks, Gil

rhpvorderman commented 1 year ago

Would --max-expected-errors be sufficient for your use case? https://cutadapt.readthedocs.io/en/stable/guide.html#filtering @marcelm I sped up the underlying calculation a bit: #700

I personally feel a --max-error-rate flag would be very helpful, as it would also work for variable length reads. Currently I have this implemented in fastq-filter but I'd rather use cutadapt than add another tool to the pipeline.

marcelm commented 1 year ago

I personally feel a --max-error-rate flag would be very helpful, as it would also work for variable length reads.

Is that the -e option in fastq-filter? That would be the number of expected errors (as computed by --max-error-rate) divided by the read length IIUC. I think I’d be fine with adding that.

I read https://www.drive5.com/usearch/manual/avgq.html back when I added --max-expected-errors; that appears to be the same motivation for -e in fastq-filter.

rhpvorderman commented 1 year ago

Is that the -e option in fastq-filter? That would be the number of expected errors (as computed by --max-error-rate) divided by the read length IIUC.

That is exactly it. It is a bit more generic as it does not care about the length and therefore works well on variable-length sequences.

The major mistake I made in fastq-filter that there is also a median filter. That is also quite terrible because it is not an informative metric. There is also a Q-score filter that correctly does the average (it converts the score to error rate internally), but I wish I hadn't added that because Q-scores are massively confusion. Q10 vs Q20 vs Q30 are massive differences and the Q-scores simply don't convey that. Error-rate is much better as it does what it says on the tin.

It was a good exercise in pedantic performance optimization though, the lookup table is really fast.

I read https://www.drive5.com/usearch/manual/avgq.html back when I added --max-expected-errors; that appears to be the same motivation for -e in fastq-filter.

Yes exactly. I got it from https://gigabaseorgigabyte.wordpress.com/2017/06/26/averaging-basecall-quality-scores-the-right-way/. Once seen this can't be unseen. I hate that almost every tool does this the wrong way, FastQC, FastP and upon informing them, they haven't even fixed it.

gilstel commented 1 year ago

Hi

I tried to read the information in the links you provided (e.g. information on the --max-expected-errors option) to but I am still confusedon how to actually use it (which threshold to use) If, for example I would like to filter out any reads whose low-quality base ratio (base quality less than or equal to 5) is more than 50%. What is the best way to perform this filtration step?

Many thanks, Gil

marcelm commented 1 year ago

If, for example I would like to filter out any reads whose low-quality base ratio (base quality less than or equal to 5) is more than 50%.

I am not familiar with this ratio. Is this something that is used in other tools? In any case, it is not directly supported in Cutadapt.

You can choose the --max-expected-errors threshold such that it is at least as strict as your filtering criterion, but it will filter more reads that would not be discarded by your filter.

Let’s assume you have a read length of $n=100$ bp. If I understand correctly, using your filter, the read will be discarded if it has more than 50 bases of quality 5. Let’s say it has exactly 50 bases of quality 5 and 50 bases of quality infinity (so we can ignore their error probabilities). A base of quality 5 has an error probability of $10^{-\frac{5}{10}}\approx0.316$. Since the number of expected errors for a read is the sum of the error probabilities, we get $50\times0.316=15.8$ expected errors for this read.

So for this read length, using --max-expected-errors with a threshold of 15.8 *or lower would guarantee that reads with your low-quality base threshold of >50% will be filtered out, but it filters out more: For example, a read with 50 quality scores of 5 and one quality score of 6 (and 49 infinite qualities) will also be filtered out by --max-expected-errors 15.8, but not with your criterion.

That said, a threshold of 15.8 appears to be very large anyway. https://www.drive5.com/usearch/manual/faq_emax.html recommends using a threshold of 1.