Missing ranks for GTDB genomes with duplicate names at different ranks

donovan-parks commented 4 months ago

I'm using v0.15.1. It appears the generated nodes.dmp is missing rank information for genomes with repeated GTDB taxon names at different ranks, e.g.:

GCA_009780445.1 dBacteria;pBacillota_A;cClostridia;oLachnospirales;fWRAA01;gWRAA01;s__WRAA01 sp009780445

For this genome, the nodes.dmp file contain a species entry and a family entry, but no genus entry. This doesn't seem correct since this genome (and species) does have a named genus in the GTDB taxonomy.

Cheers, Donovan

shenwei356 commented 4 months ago

Yes, for these cases where child and parent taxons share the same name, I thought it's because there's no classified genus (which is common in NCBI taxonomy, especially in Viruses) according to some observations. e.g.,

$ taxonkit list --ids 1698208185 --data-dir . -nr
1698208185 [family] WRAA01
  788434200 [species] WRAA01 sp009780445
    1595698180 [no rank] 009780445
  1373682363 [species] WRAA01 sp009780015
    561963250 [no rank] 009780015

If there's a genus, there should be only one, WRAA01. If there is more than one genus, the genus name would not be the same as the parent (family).

Ah, I should asked you before writing this command. I'm open to discussion now.

For the current implementation, we can also output the genus name, according to the parent taxon.

$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/
GCA_009780445.1 1595698180      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;;WRAA01 sp009780445

$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/ -F -p "" -s ""
GCA_009780445.1 1595698180      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445

donovan-parks commented 4 months ago

Hi.

Thank you for the quick response.

The GTDB taxonomy is always "complete". That is. every genome is assigned to all 7 ranks. The genome GCA_009780445.1 is assigned to the genus gWRAA01 in the family fWRAA01. In the GTDB taxonomy these are distinct labels as designated by the different rank suffix. As such, I would expect the nodes.dmp file to contain a genus and family entry, both with the name WRAA01.

Cheers, Donovan

donovan-parks commented 4 months ago

Note that the missing genus entry in nodes.dmp is potentially a problem. For example, I'm using the output of TaxonKit as input files for MetaCache and I would want it to be understood that GCA_009780445.1 belongs to the genus WRAA01.

shenwei356 commented 4 months ago

I see. So I have to change the whole logic.

In the GTDB taxonomy these are distinct labels as designated by the different rank suffix.

Do you mean prefix? And this inspires me. Thank you! To distinguish duplicated names, like the family WRAA01 and genus WRAA01, I think I can just hash names with the rank prefix, e.g., f__WRAA01 and g__WRAA01, to get a unique TaxId. I'll do it tomorrow, it's late in UK.

Best, Wei

donovan-parks commented 4 months ago

Thanks - much appreciated. And yes, I meant prefix (f, g).

shenwei356 commented 4 months ago

Now, duplicated names with different ranks are allowed. TaxIds are generated from the hash value of rank+taxon_name (in lower case) .

$ grep GCA_009780445.1 gtdb-taxdump/R214/taxid.map \
    | taxonkit reformat -I 2 --data-dir gtdb-taxdump/R214/
GCA_009780445.1 1662163052      Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445

$ echo WRAA01 | taxonkit name2taxid --data-dir gtdb-taxdump/R214/ -r
WRAA01  718672132       genus
WRAA01  1562716195      family

donovan-parks commented 4 months ago

Thanks Wei - much appreciated.

shenwei356 commented 4 months ago

Oh right, here's the way to generate GTDB-like format

echo 599451526 \
    | taxonkit reformat -I 1 -P --prefix-k d__
599451526       d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli

shenwei356 / taxonkit

Missing ranks for GTDB genomes with duplicate names at different ranks #92