khyox / recentrifuge

Recentrifuge: robust comparative analysis and contamination removal for metagenomics
http://www.recentrifuge.org
Other
86 stars 7 forks source link

Score understanding #44

Closed lvelosuarez closed 10 months ago

lvelosuarez commented 2 years ago

Hello, thanks for developing this software, I found it really useful in my work, clinical diagnosis with shotgun metagenomics. I am running rcf with some of my RNA libraries results from kraken2. This is my command:

rcf -n /DATA/share/microbio/taxdump -k tneg_revelo.krk -k influenza_revelo.krk -k metapn_revelo.krk -c 1 -o revelo_mintaxa500.html  -s KRAKEN  -e CSV -y 50 -m 500 -d

So I have chosen KRAKEN as score (% kmer coverage) and min score 50 (so my understanding is that I am refusing reads that are not 50 % kmer coverage) and min taxa 500 (taxa with less than 500 reads will be folded) My question: in my stats results I see Score limit 50 but Score min 102 , score mean 227 and score max 269, how comes ? I was expecting %-like scoring as I have choses KRAKEN as score so I do not understand what these min, mean , max scores means in the stat file ... Can you shed some light into these ?

Again, thanks :)

khyox commented 2 years ago

Hello @lvelosuarez!

Many thanks for the introduction! I appreciate that you take some time/words to let me know about your work and your use case for Recentrifuge. :)

That's a good question! I have clarified the relevant section of the wiki by adding a new paragraph so, hopefully, it will answer your question now. In short, the value reported for "score limit" in the stats matches the value entered for the minscore parameter whatever the scoring scheme selected (50 in your case), but a different thing happens with the score statistics (min, mean, max): they are always referred to SHEL (Single Hit Equivalent Length). The rationale for this is that SHEL values facilitate the comparison with the results obtained for the same sample from other taxonomic classifiers —in some difficult and not so difficult cases you want to see what different classifiers find in your samples and compare the uncertainty levels.

Thanks for bringing my attention to this obscure point of the documentation. Hope this is clearer know and your question was answered. :)