Diamond parameters - max-target-seqs

hjruscheweyh commented 2 years ago

Dear Gunc Team

I have tried to use GUNC recently to quality control the MAGs that we're building in the lab. For that I also looked into the internal diamond command for which you state the -k 1 will be retaining best hits. However setting -k 1 will search for the best hits but report only one. Setting --top 0 on the other hand will report all best hits. I'm not sure about the internals of GUNC but this could potentially inflate false positive detection of chimeras?

Example:

diamond .... --max-target-seqs 1 --> will find the best hit but ignore 2 other hits that score equally well

GENE    GB_GCA_001915545.1  100.0   221 0   0   1   221 1   221 1.4e-122    447.6

diamond .... --top 0 --> will find multiple best hits

GENE    GB_GCA_001915545.1  100.0   221 0   0   1   221 1   221 1.4e-122    447.6
GENE    GB_GCA_900555085.1  100.0   221 0   0   1   221 1   221 1.4e-122    447.6
GENE    RS_GCF_000012825.1  100.0   221 0   0   1   221 1   221 1.4e-122    447.6

defleury commented 2 years ago

Hi Hans!

Thanks for bringing this up, this is a good point. We had played with these parameters during prototyping/development and eventually decided to use this heuristic because it was faster, more memory efficient and did not affect results noticeably. One way that such issues can come up is if incomplete/partial genes are called; we are looking into that for other reasons as well for future versions.

That said, it is of course possible in theory that accuracy suffers when we arbitrarily pick one hit among several equally good ones. Yet the overall method overall is relatively robust to that:

If the single top hit for a gene aligns with the consensus taxonomy for the other genes on the same contig, this should not actually be considered an error imo.
If the top hit differs from the consensus among other genes on the same contig, it is expected to do so in a non-deterministic way. GUNC has several 'noise' filters against lowly prevalent classifications that we found take care of such cases without a relevant penalty on sensitivity.
If a gene with multiple alternative best hits is in a larger contig (of even as few as >5 genes), those will 'outvote' the conflicting label.

So for the problem to really affect chimerism calls, we would need to see multiple genes for which the top hit was picked wrongly, but in a consistent way – e.g., if three equally good hits are to taxa A,B & C, the top hit would have to be to the same clade (e.g., A) disproportionally in order to affect the result. Also, the current behaviour would usually not lead to a false positive chimerism call, but more likely to a false negative: if too much 'noise' is introduced by arbitrarily picked alternative top hits, the confidence in the overall scoring is lowered.

Long story short, we benchmarked this during early dev stages, but we will have to either dig up those files, or (more likely) rerun some of these tests to fully address your point. Either @Askarbek-orakov or I will follow up here.

hjruscheweyh commented 2 years ago

Hi!

thank you for the quick answer. You're probably right that this issue will even out over the length of a genome. I also think that reducing the max-target-seqs will have an impact on the runtime for small query sizes. For large number of genomes you will probably have more impact when using the -c and -b parameters. I have updated the code on our local installation and have set it to -c1 and -b12 to maximize the memory usage and reducing runtime.

Best, Hans

fullama commented 2 years ago

Yeah we did consider having non-default values for -b and -c but thought that most users would probably run on single genome or may not have access to a large-ish cluster.. but if its useful maybe we could add it as optional arguments in the next release

hjruscheweyh commented 2 years ago

Hi!

you're probably right. I tried to manually update the parameters and then started gunc with 25k genomes and the gtdb database. Gene calling and postprocessing took quite a while and was not using the resources too efficiently. diamond then required like 15 hours with 64 cores, 1TB of RAM and 4TB of disk space.

I will go back to the default parameters and run only small batches of genomes using more cores.

Best, Hans

grp-bork / gunc

Diamond parameters - max-target-seqs #20