Closed donovan-parks closed 4 months ago
Yes, for these cases where child and parent taxons share the same name, I thought it's because there's no classified genus (which is common in NCBI taxonomy, especially in Viruses) according to some observations. e.g.,
$ taxonkit list --ids 1698208185 --data-dir . -nr
1698208185 [family] WRAA01
788434200 [species] WRAA01 sp009780445
1595698180 [no rank] 009780445
1373682363 [species] WRAA01 sp009780015
561963250 [no rank] 009780015
If there's a genus, there should be only one, WRAA01
. If there is more than one genus, the genus name would not be the same as the parent (family).
Ah, I should asked you before writing this command. I'm open to discussion now.
For the current implementation, we can also output the genus name, according to the parent taxon.
$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/
GCA_009780445.1 1595698180 Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;;WRAA01 sp009780445
$ grep GCA_009780445.1 R214.1/taxid.map | taxonkit reformat -I 2 --data-dir R214.1/ -F -p "" -s ""
GCA_009780445.1 1595698180 Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445
Hi.
Thank you for the quick response.
The GTDB taxonomy is always "complete". That is. every genome is assigned to all 7 ranks. The genome GCA_009780445.1 is assigned to the genus gWRAA01 in the family fWRAA01. In the GTDB taxonomy these are distinct labels as designated by the different rank suffix. As such, I would expect the nodes.dmp file to contain a genus and family entry, both with the name WRAA01.
Cheers, Donovan
Note that the missing genus entry in nodes.dmp is potentially a problem. For example, I'm using the output of TaxonKit as input files for MetaCache and I would want it to be understood that GCA_009780445.1 belongs to the genus WRAA01.
I see. So I have to change the whole logic.
In the GTDB taxonomy these are distinct labels as designated by the different rank suffix.
Do you mean prefix?
And this inspires me. Thank you! To distinguish duplicated names, like the family WRAA01
and genus WRAA01
, I think I can just hash names with the rank prefix, e.g., f__WRAA01
and g__WRAA01
, to get a unique TaxId.
I'll do it tomorrow, it's late in UK.
Best, Wei
Thanks - much appreciated. And yes, I meant prefix (f, g).
Now, duplicated names with different ranks are allowed.
TaxIds are generated from the hash value of rank
+taxon_name
(in lower case) .
$ grep GCA_009780445.1 gtdb-taxdump/R214/taxid.map \
| taxonkit reformat -I 2 --data-dir gtdb-taxdump/R214/
GCA_009780445.1 1662163052 Bacteria;Bacillota_A;Clostridia;Lachnospirales;WRAA01;WRAA01;WRAA01 sp009780445
$ echo WRAA01 | taxonkit name2taxid --data-dir gtdb-taxdump/R214/ -r
WRAA01 718672132 genus
WRAA01 1562716195 family
Thanks Wei - much appreciated.
Oh right, here's the way to generate GTDB-like format
echo 599451526 \
| taxonkit reformat -I 1 -P --prefix-k d__
599451526 d__Bacteria;p__Pseudomonadota;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia coli
I'm using v0.15.1. It appears the generated nodes.dmp is missing rank information for genomes with repeated GTDB taxon names at different ranks, e.g.:
GCA_009780445.1 dBacteria;pBacillota_A;cClostridia;oLachnospirales;fWRAA01;gWRAA01;s__WRAA01 sp009780445
For this genome, the nodes.dmp file contain a species entry and a family entry, but no genus entry. This doesn't seem correct since this genome (and species) does have a named genus in the GTDB taxonomy.
Cheers, Donovan