pirovc / ganon

ganon2 classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more
https://pirovc.github.io/ganon/
MIT License
86 stars 13 forks source link

ganon classify: --rel-cutoff and --rel-filter guidelines? #196

Closed rjsorr closed 2 years ago

rjsorr commented 2 years ago

Hi, I'm using a --rel-cutoff (c) and --rel-filter (e) value of 0.25 in classify (window size of 32 from build). I'm getting approx 50% classified and approx 20% of these as unique matches. I would in theory like to bump up both values, if indeed possible? I'm using 130-150bp reads in pairs.

I'm wondering if you have any guidelines to the use of these parameters above that written on the main page? Or even a suggestion as to recommended values? I'm guessing that lowering --rel-cutoff (c) will recruit more reads but at the cost of unique matches. So what is a good trade off?

pirovc commented 2 years ago

Hi @rjsorr,

In general terms, the lower the --rel-cutoff, the more reads you will get classified. The higher the --rel-filter, the less reads will be unique, since more matches are going to pass the filter.

A good configuration will depend on how well represented and similar are your reads in the reference databases. It may be possible to bump up your percentages but it doesn't mean you gonna get better results, because you are increasing the chance of false positives. Think about those parameters as a way to reduce false positives, because the good matches will always be reported with the current values you're using.

If you set --rel-cutoff 0 --rel-filter 1 you will get every match between a read and a reference sharing at least 1 k-mer (1 minimizer in your case). From there you can start increasing --rel-cutoff and decreasing --rel-filter to see where's the best fit for you. I'd try --rel-cutoff 0.15 --rel-filter 0.1 but that's just a guess. Hope that helps.

rjsorr commented 2 years ago

cheers for the quick reply @pirovc! Completely agree regarding false positives and this is of course the main consideration. No point increasing classified reads if they are not correctly classified. However, that said it is difficult for me to test this on my dataset, at least not without a lot of extra work so I'm going to trust your suggestion/experience here. The reasoning makes sense for me :) regards

rjsorr commented 2 years ago

fyi @pirovc, Difficult for me to say anything about the false-positives without a lot of extra testing. But after checking -c and -e, the best -c for me was 0.1 whilst a -e of 0 gave a much higher classification to the superkingdoms as well as increasing unique matches. Increasing -c to 0.5 gave 27% less unique matches (from 52% to 25%) and 19% less classified to superkingdoms (from 63% to 44%). So quite a difference!