ncbi / pgap

NCBI Prokaryotic Genome Annotation Pipeline
Other
310 stars 90 forks source link

Unexpected results with --taxcheck-only #168

Closed hdore closed 2 years ago

hdore commented 3 years ago

Hello,

I'm using pgap with Singularity on a HPC. The image version is pgap_2021-07-01.build5508.sif .

I tried to use taxcheck-only on some MAGs (circular MAGs from polished assemblies of long reads), and it gave me unexpectedly low (Query coverage, Subject coverage) values (around 1% or below 1%), where I would expect higher values.

I had previously used gtdb-tk to have an idea of their lineage. In one example, gtdb-tk indicates 99.91% ANI and 1.0 aligned fraction (100%) to NCBI' s assembly GCA_003525925.1, which I know to be the same (or very similar to) GCA_016745215.1 (Thiohalocapsa sp. PB-PSB1). I ran pgap.py with Thiohalocapsa as "genus_species" and obtained the following result:

ANI report for assembly: my_gc_assm_name
Submitted organism: Thiohalocapsa (taxid = 85079, rank = genus, lineage = Bacteria; Proteobacteria; Gammaproteobacteria; Chromatiales; Chromatiaceae)
Predicted organism: Massilia niastensis (taxid = 544911, rank = species, lineage = Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales; Oxalobacteraceae; Massilia)
Submitted organism has type: No
Status: INCONCLUSIVE
Confidence: LOW
81.504 ( 1.1  2.0) 14002248 assembly    46865 Thiohalocapsa marina (GCA_008632335.1, ASM863233v1)
80.728 (  .6   .6)  5715848 assembly    18158 Candidatus Thiodictyon syntrophicum (GCA_002813775.1, ASM281377v1)
80.749 (  .4   .4)   602798 assembly     3443 Lamprocystis purpurea DSM 4197 (GCA_000379525.1, ASM37952v1)
81.481 (  .3   .6)   890448 assembly        0 Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3)
83.035 (  .2   .5)  8727128 assembly     4795 Marichromatium gracile (GCA_004343155.1, ASM434315v1)
83.146 (  .1   .2)  3368108 assembly        0 Thauera phenolivorans (GCA_001696715.1, ASM169671v1)
81.163 (  .1   .1)  6500588 assembly        0 Azospirillum thermophilum (GCA_003130795.1, ASM313079v1)
81.737 (  .1   .1)  8237148 assembly        0 Rubrivivax albus (GCA_004016515.1, ASM401651v1)
82.927 (  .1   .1) 10019778 assembly        0 Aquabacterium pictum (GCA_005403045.1, ASM540304v1)
83.735 (  .1   .1)  1177498 assembly        0 Burkholderia paludis (GCA_000732615.1, ASM73261v1)
83.792 (  .0   .1) 15456508 assembly        0 Duganella albus (GCA_009674495.1, ASM967449v1)
78.484 (  .1   .0)  7075268 assembly        0 Kutzneria buriramensis (GCA_003387475.1, ASM338747v1)
78.206 (  .1   .0)  6819288 assembly        0 Amycolatopsis albispora (GCA_003312875.1, ASM331287v1)
78.413 (  .1   .0)   295128 assembly        0 Amycolatopsis mediterranei S699 (GCA_000220945.1, ASM22094v1)
77.975 (  .0   .0)  3706568 assembly        0 Actinoplanes philippinensis (GCA_900113015.1, IMG-taxon 2675903218 annotated assembly)
78.543 (  .0   .0)  8072838 assembly        0 Amycolatopsis eburnea (GCA_003937945.1, ASM393794v1)
79.424 (  .0   .0)  4725438 assembly        0 Streptomyces glauciniger (GCA_900188405.1, IMG-taxon 2671180020 annotated assembly)
78.509 (  .0   .0) 15540238 assembly        0 Catellatospora paridis (GCA_009720365.1, ASM972036v1)
79.142 (  .0   .0)  2788388 assembly        0 Streptomyces antibioticus (GCA_001514065.1, ASM151406v1)
87.199 (  .0   .0)   607618 assembly        0 Massilia niastensis DSM 21313 (GCA_000382345.1, ASM38234v1)

I would be expecting values above 10% at least for some of these comparisons, e.g with Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3). I guess the "coverage" value depends at what ANI cut-off the coverage is computed, so the values from taxcheck-only might be real if the cut-off is high?

In addition, even if the status of taxcheck is "inconclusive", it outputs a "Predicted organism" and seems to choose the one with highest ANI even if the "coverage" value is very low, which I find surprising (here Massilia niastensis).

This result should be reproducible by using GCA_016745215.1 as input genome.

Thank you,

Hugo Doré

thibaudnis commented 3 years ago

Thank you for the report. We will communicate these results to the ANI team.

thibaudnis commented 2 years ago

Hello @hdore : there is a new release, 2022-02-10.build5872, which addresses the issue you reported. In this release, the ANI team introduced a minimum coverage requirement to name a 'Predicted organism'. If the query assembly covers less than 20% of all type material assemblies it was compared to, no predicted organism is returned. The bar was set at 20% because there are real contamination cases for which the best match is between 20 and 50%. In addition, I should have clarified earlier that the taxcheck is performed against type assemblies only, so similarity to assemblies that are not type will not be detected. You can find more information in this publication. Please let us know how this new release works for you!

hdore commented 2 years ago

Hello @thibaudnis , Awesome, thank you! Did the ANI team comment on the following remark I made above? "I would be expecting values above 10% at least for some of these comparisons, e.g with Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3). I guess the "coverage" value depends at what ANI cut-off the coverage is computed, so the values from taxcheck-only might be real if the cut-off is high?"

Thank you for clarifying about the type assemblies.

The main reason why I was running taxcheck-only was to determine which Genus to indicate in my yaml file. I see in your reply to issue #173 that the new release includes a --auto-correct-tax option. That's very helpful! How would that work if no 'Predicted organism' is in the output of taxcheck?

Thank you,

hdore

thibaudnis commented 2 years ago

Hmm... I didn't get any comment back on this specific case. Let me circle back. If Predicted organism is "none" and --auto-correct-tax is set, the program stops after the taxcheck and before the annotation starts, with the message 'ERROR: taxcheck failed to assign a species with high confidence, thus PGAP will not execute. See /ani-tax-report.txt'.

thibaudnis commented 2 years ago

I would be expecting values above 10% at least for some of these comparisons, e.g with Marichromatium purpuratum 984 (GCA_000224005.3, ASM22400v3)

Would you expect that with GCA_016745215.1 as well? Is you expectation based on gtdb-tk results? Our process consitutes of two steps. The first step is a k-mer analysis for building a set of candidate type assemblies that the query assembly is closest to. The second step is a set of pairwise Blast searches of the query vs. each candidate type assembly using a word size of 28. Marichromatium purpuratum 984 (GCA_000224005.3) passes the first step, but the query and subject coverage values returned by the Blast search are very low. It is possible that a lower word size would detect more regions of homology and increase the coverage between Thiohalocapsa and Marichromatium purpuratum up to the value of 10% you expect. Regardless, these two assemblies appear so distant that we can be confident that the query assembly is not of species Marichromatium purpuratum.

hdore commented 2 years ago

I'm not sure what I used at that time but it was probably not a blast search with a word size of 28, so that is likely why the result was different. I agree that we can be confident that it is not the same species (not event the same genus). I initially wanted to run taxcheck-only to select a Genus that I could use in the yaml file, which can now be done automatically in the new relase. I did not realize that taxcheck was performed against type assemblies only and thus limited the set of possible genera/species. I guess I should not use the --auto-correct-tax option if my genomes are too far from any type material assemblies.

Thank you for your help and clarifications on how taxcheck works!