Open MiguelHK opened 4 months ago
I'd recommend fastp, which supports this. look here: https://github.com/OpenGene/fastp?tab=readme-ov-file#quality-filter , maybe you can use -q 30 -u 20
.
BTW, a read with 50 bp with score 20 and 50 bp with score 40, the average quality score is not 30.
I am aware that fastp is capable of doing this, however I use seqkit for several steps and it would be great if this would also be a feature of seqkit.
About the average quality score, you are correct, the average score is ~23. Thanks for pointing it out!
Currently seqkit seq filters by quality based on an average quality score. However, other tools such as FASTX's fastq_quality_filter, allow the user to select how many nucleotides (as a percentage) he wants to have a minimum PHRED score of X. Example:
We have 5 sequences that are 100 nucleotides long:
With an average phred score of 30, these sequences might be acceptable using seqkit seq --min-qual 30, but if we want to make sure that a low percentage of the nucleotides have a very low quality (let's say we only want 20% of nucleotides to be below a phred score of 30), all of these sequences would be discarded. This is currently not possible with seqkit but it is possible (albeit slower) with other tools.
Now, knowing how flexible and fast seqkit is, I would love to see this feature included!