Closed hdore closed 2 years ago
Thank you for the report. We will communicate these results to the ANI team.
Hello @hdore : there is a new release, 2022-02-10.build5872, which addresses the issue you reported. In this release, the ANI team introduced a minimum coverage requirement to name a 'Predicted organism'. If the query assembly covers less than 20% of all type material assemblies it was compared to, no predicted organism is returned. The bar was set at 20% because there are real contamination cases for which the best match is between 20 and 50%. In addition, I should have clarified earlier that the taxcheck is performed against type assemblies only, so similarity to assemblies that are not type will not be detected. You can find more information in this publication. Please let us know how this new release works for you!
Hello @thibaudnis , Awesome, thank you! Did the ANI team comment on the following remark I made above? "I would be expecting values above 10% at least for some of these comparisons, e.g with Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3). I guess the "coverage" value depends at what ANI cut-off the coverage is computed, so the values from taxcheck-only might be real if the cut-off is high?"
Thank you for clarifying about the type assemblies.
The main reason why I was running taxcheck-only was to determine which Genus to indicate in my yaml file. I see in your reply to issue #173 that the new release includes a --auto-correct-tax
option. That's very helpful! How would that work if no 'Predicted organism' is in the output of taxcheck?
Thank you,
hdore
Hmm... I didn't get any comment back on this specific case. Let me circle back. If Predicted organism is "none" and --auto-correct-tax is set, the program stops after the taxcheck and before the annotation starts, with the message 'ERROR: taxcheck failed to assign a species with high confidence, thus PGAP will not execute. See
I would be expecting values above 10% at least for some of these comparisons, e.g with Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3)
Would you expect that with GCA_016745215.1 as well? Is you expectation based on gtdb-tk results? Our process consitutes of two steps. The first step is a k-mer analysis for building a set of candidate type assemblies that the query assembly is closest to. The second step is a set of pairwise Blast searches of the query vs. each candidate type assembly using a word size of 28. Marichromatium purpuratum 984 (GCA_000224005.3) passes the first step, but the query and subject coverage values returned by the Blast search are very low. It is possible that a lower word size would detect more regions of homology and increase the coverage between Thiohalocapsa and Marichromatium purpuratum up to the value of 10% you expect. Regardless, these two assemblies appear so distant that we can be confident that the query assembly is not of species Marichromatium purpuratum.
I'm not sure what I used at that time but it was probably not a blast search with a word size of 28, so that is likely why the result was different.
I agree that we can be confident that it is not the same species (not event the same genus).
I initially wanted to run taxcheck-only to select a Genus that I could use in the yaml file, which can now be done automatically in the new relase.
I did not realize that taxcheck was performed against type assemblies only and thus limited the set of possible genera/species.
I guess I should not use the --auto-correct-tax
option if my genomes are too far from any type material assemblies.
Thank you for your help and clarifications on how taxcheck works!
Hello,
I'm using pgap with Singularity on a HPC. The image version is pgap_2021-07-01.build5508.sif .
I tried to use taxcheck-only on some MAGs (circular MAGs from polished assemblies of long reads), and it gave me unexpectedly low (Query coverage, Subject coverage) values (around 1% or below 1%), where I would expect higher values.
I had previously used gtdb-tk to have an idea of their lineage. In one example, gtdb-tk indicates 99.91% ANI and 1.0 aligned fraction (100%) to NCBI' s assembly GCA_003525925.1, which I know to be the same (or very similar to) GCA_016745215.1 (Thiohalocapsa sp. PB-PSB1). I ran pgap.py with Thiohalocapsa as "genus_species" and obtained the following result:
I would be expecting values above 10% at least for some of these comparisons, e.g with Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3). I guess the "coverage" value depends at what ANI cut-off the coverage is computed, so the values from taxcheck-only might be real if the cut-off is high?
In addition, even if the status of taxcheck is "inconclusive", it outputs a "Predicted organism" and seems to choose the one with highest ANI even if the "coverage" value is very low, which I find surprising (here Massilia niastensis).
This result should be reproducible by using GCA_016745215.1 as input genome.
Thank you,
Hugo Doré