Closed fancyge closed 3 years ago
Thanks a lot! I just open this new issue since it's another question.
I am thinking of setting a score for certain reliability and I'm looking at parameter -y/--minscore in rextract. At first I thought this is to filter assignment based on the scores as shown in the 4th column (with header "score") in the centrifuge result. However, it seems not. Can you please tell me what is this -y/--minscore? How is the quality control working here in rextract? I also want to ask about if you know this 4th column (with header "score") in the centrifuge result. Shall I filter the bad assignment with low scores from centrifuge result first or I can just use rextract in one go? Thank you.
Thanks for opening a new issue for this.
The nature of the downstream analysis of Recentrifuge tools requires that the taxonomic classifier provides at least one score per read. In the case of Centrifuge, both recentrifuge
and rextract
use the score column that you mention. The default scoring option for Centrifuge converts that score to SHEL (Single Hit Equivalent Length), which is an intuitive scoring system used by the classifier LMAT. You have more details and also other available scoring schemes for Recentrifuge when parsing Centrifuge outputs in the wiki.
At this point, you probably already know the answer to your last question: you can use rextract
in one go using the minscore
option to filter the assignments below the indicated SHEL.
Thanks! That's quite helpful information! Now I see how they are related.
Hi, I have another question about the score. I wonder if there is a way to exclude reads of a taxID that pass a score while the reads of that taxID below the score will still remain? Thanks.
Do you mean to invert the filter so that it filters scores over a threshold instead of under it?
Yes!
That feature is not implemented, but I see that it could be interesting for some particular situations. Anyway, if you want a quick solution, you may get it by inverting the inequality in the line 120 of the file centrifuge.py
in Recentrifuge, so it would be:
if minscore is not None and shel > minscore:
That would discard the reads with score over the SHEL value.
Thanks a lot. I have tried with your recommendation but this didn't affect my test results. Does rextract call centrifuge.py to function?
No, that was for rcf
, the main script of Recentrifuge. For rextract
, do the same but with its line 230, so it would become:
if args.minscore is not None and score > args.minscore:
Thank you very much!
My recommendation is to avoid too low
minscore
(-y
flag also) values to filter sequences with low scores. Also, if you have control sequences, you may want to lowerctrlminscore
(-z
flag also) to have more sequences in the controls and thus more sequences removed after the robust control removal algorithm. So, --minscore 35 and --ctrlminscore 25 could be good values to start with.Originally posted by @khyox in https://github.com/khyox/recentrifuge/issues/30#issuecomment-801449476