shenwei356 / seqkit

A cross-platform and ultrafast toolkit for FASTA/Q file manipulation
https://bioinf.shenwei.me/seqkit
MIT License
1.3k stars 158 forks source link

Seqkit seq could filter by a %of nucleotides above a specific quality threshold (both user-defined) #472

Open MiguelHK opened 3 months ago

MiguelHK commented 3 months ago

Currently seqkit seq filters by quality based on an average quality score. However, other tools such as FASTX's fastq_quality_filter, allow the user to select how many nucleotides (as a percentage) he wants to have a minimum PHRED score of X. Example:

We have 5 sequences that are 100 nucleotides long:

With an average phred score of 30, these sequences might be acceptable using seqkit seq --min-qual 30, but if we want to make sure that a low percentage of the nucleotides have a very low quality (let's say we only want 20% of nucleotides to be below a phred score of 30), all of these sequences would be discarded. This is currently not possible with seqkit but it is possible (albeit slower) with other tools.

Now, knowing how flexible and fast seqkit is, I would love to see this feature included!

shenwei356 commented 3 months ago

I'd recommend fastp, which supports this. look here: https://github.com/OpenGene/fastp?tab=readme-ov-file#quality-filter , maybe you can use -q 30 -u 20.

BTW, a read with 50 bp with score 20 and 50 bp with score 40, the average quality score is not 30.

MiguelHK commented 3 months ago

I am aware that fastp is capable of doing this, however I use seqkit for several steps and it would be great if this would also be a feature of seqkit.

About the average quality score, you are correct, the average score is ~23. Thanks for pointing it out!