Closed hjruscheweyh closed 2 years ago
Hi Hans!
Thanks for bringing this up, this is a good point. We had played with these parameters during prototyping/development and eventually decided to use this heuristic because it was faster, more memory efficient and did not affect results noticeably. One way that such issues can come up is if incomplete/partial genes are called; we are looking into that for other reasons as well for future versions.
That said, it is of course possible in theory that accuracy suffers when we arbitrarily pick one hit among several equally good ones. Yet the overall method overall is relatively robust to that:
So for the problem to really affect chimerism calls, we would need to see multiple genes for which the top hit was picked wrongly, but in a consistent way – e.g., if three equally good hits are to taxa A,B & C, the top hit would have to be to the same clade (e.g., A) disproportionally in order to affect the result. Also, the current behaviour would usually not lead to a false positive chimerism call, but more likely to a false negative: if too much 'noise' is introduced by arbitrarily picked alternative top hits, the confidence in the overall scoring is lowered.
Long story short, we benchmarked this during early dev stages, but we will have to either dig up those files, or (more likely) rerun some of these tests to fully address your point. Either @Askarbek-orakov or I will follow up here.
Hi!
thank you for the quick answer. You're probably right that this issue will even out over the length of a genome.
I also think that reducing the max-target-seqs will have an impact on the runtime for small query sizes. For large number of genomes you will probably have more impact when using the -c
and -b
parameters. I have updated the code on our local installation and have set it to -c1
and -b12
to maximize the memory usage and reducing runtime.
Best, Hans
Yeah we did consider having non-default values for -b
and -c
but thought that most users would probably run on single genome or may not have access to a large-ish cluster.. but if its useful maybe we could add it as optional arguments in the next release
Hi!
you're probably right. I tried to manually update the parameters and then started gunc with 25k genomes and the gtdb database. Gene calling and postprocessing took quite a while and was not using the resources too efficiently. diamond then required like 15 hours with 64 cores, 1TB of RAM and 4TB of disk space.
I will go back to the default parameters and run only small batches of genomes using more cores.
Best, Hans
Dear Gunc Team
I have tried to use GUNC recently to quality control the MAGs that we're building in the lab. For that I also looked into the internal diamond command for which you state the
-k 1
will beretaining best hits
. However setting-k 1
will search for the best hits but report only one. Setting--top 0
on the other hand will report all best hits. I'm not sure about the internals of GUNC but this could potentially inflate false positive detection of chimeras?Example:
diamond .... --max-target-seqs 1
--> will find the best hit but ignore 2 other hits that score equally welldiamond .... --top 0
--> will find multiple best hits