khyox / recentrifuge

Recentrifuge: robust comparative analysis and contamination removal for metagenomics
http://www.recentrifuge.org
Other
86 stars 7 forks source link

parameter -y/--minscore in rextract #31

Closed fancyge closed 3 years ago

fancyge commented 3 years ago

By the way, do you have an idea what score (-y) I should feed to rextract for filtering centrifuge results? I have 150bp fastq files from illumina sequencing.

My recommendation is to avoid too low minscore (-y flag also) values to filter sequences with low scores. Also, if you have control sequences, you may want to lower ctrlminscore (-z flag also) to have more sequences in the controls and thus more sequences removed after the robust control removal algorithm. So, --minscore 35 and --ctrlminscore 25 could be good values to start with.

Originally posted by @khyox in https://github.com/khyox/recentrifuge/issues/30#issuecomment-801449476

fancyge commented 3 years ago

Thanks a lot! I just open this new issue since it's another question.

I am thinking of setting a score for certain reliability and I'm looking at parameter -y/--minscore in rextract. At first I thought this is to filter assignment based on the scores as shown in the 4th column (with header "score") in the centrifuge result. However, it seems not. Can you please tell me what is this -y/--minscore? How is the quality control working here in rextract? I also want to ask about if you know this 4th column (with header "score") in the centrifuge result. Shall I filter the bad assignment with low scores from centrifuge result first or I can just use rextract in one go? Thank you.

khyox commented 3 years ago

Thanks for opening a new issue for this.

The nature of the downstream analysis of Recentrifuge tools requires that the taxonomic classifier provides at least one score per read. In the case of Centrifuge, both recentrifuge and rextract use the score column that you mention. The default scoring option for Centrifuge converts that score to SHEL (Single Hit Equivalent Length), which is an intuitive scoring system used by the classifier LMAT. You have more details and also other available scoring schemes for Recentrifuge when parsing Centrifuge outputs in the wiki.

At this point, you probably already know the answer to your last question: you can use rextract in one go using the minscore option to filter the assignments below the indicated SHEL.

fancyge commented 3 years ago

Thanks! That's quite helpful information! Now I see how they are related.

fancyge commented 3 years ago

Hi, I have another question about the score. I wonder if there is a way to exclude reads of a taxID that pass a score while the reads of that taxID below the score will still remain? Thanks.

khyox commented 3 years ago

Do you mean to invert the filter so that it filters scores over a threshold instead of under it?

fancyge commented 3 years ago

Yes!

khyox commented 3 years ago

That feature is not implemented, but I see that it could be interesting for some particular situations. Anyway, if you want a quick solution, you may get it by inverting the inequality in the line 120 of the file centrifuge.py in Recentrifuge, so it would be:

                if minscore is not None and shel > minscore:

That would discard the reads with score over the SHEL value.

fancyge commented 3 years ago

Thanks a lot. I have tried with your recommendation but this didn't affect my test results. Does rextract call centrifuge.py to function?

khyox commented 3 years ago

No, that was for rcf, the main script of Recentrifuge. For rextract, do the same but with its line 230, so it would become:

                if args.minscore is not None and score > args.minscore:
fancyge commented 3 years ago

Thank you very much!