Closed ggavelis closed 1 year ago
Hi Greg, thanks for using.
Since the names are identical, taxonkit has evidently dropped all but one instance of each, which resulted in those ranks getting skipped/collapsed in the nodes.dmp and names.dmp file.
Yes, it did this intentionally. https://github.com/shenwei356/taxonkit/blob/d1981698050abb058572b8ee008ea52ef8a7803b/taxonkit/cmd/create-taxdump.go#L532
Some species, like SpSt-165 sp011057335
, do not have all taxon names at each rank, which are more common in viruses.
$ echo 1993618108 \
| taxonkit lineage --data-dir .
1993618108 Bacteria;CSP1-3;SpSt-165;SpSt-165 sp011057335
$ echo 1993618108 \
| taxonkit lineage --data-dir . \
| taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
| csvtk cut -Ht -f -2 \
| csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \
| csvtk pretty -t -S bold
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ taxid ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family ┃ genus ┃ species ┃
┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━╋━━━━━━━╋━━━━━━━╋━━━━━━━━━━╋━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ 1993618108 ┃ Bacteria ┃ CSP1-3 ┃ ┃ ┃ SpSt-165 ┃ ┃ SpSt-165 sp011057335 ┃
┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━┻━━━━━━━┻━━━━━━━┻━━━━━━━━━━┻━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛
While GTDB still shows names of each nodes, maybe for readability. While if we add them to taxdump files, that would add complexity.
$ zcat ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz \
| grep sp011057335 \
| head -n 1 \
| cut -f 2
d__Bacteria;p__CSP1-3;c__CSP1-3;o__CSP1-3;f__SpSt-165;g__SpSt-165;s__SpSt-165 sp011057335
The lineage above can also be generated by taxonkit reformat
(the flag -s
newly added in v0.14.3, please use the binaries below). taxonkit reformat
can fill missing taxon nodes with parent information.
$ echo 1993618108 \
| taxonkit lineage --data-dir . \
| taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d__ -p "" -s "" \
| cut -f 3
d__Bacteria;p__CSP1-3;c__CSP1-3;o__CSP1-3;f__SpSt-165;g__SpSt-165;s__SpSt-165 sp011057335
$ echo 1993618108 \
| taxonkit lineage --data-dir . \
| taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P --prefix-k d__ -p "" -s "" \
| csvtk cut -Ht -f -2 \
| csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \
| csvtk pretty -t -S bold
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ taxid ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family ┃ genus ┃ species ┃
┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ 1993618108 ┃ d__Bacteria ┃ p__CSP1-3 ┃ c__CSP1-3 ┃ o__CSP1-3 ┃ f__SpSt-165 ┃ g__SpSt-165 ┃ s__SpSt-165 sp011057335 ┃
┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━┛
Maybe you can use this method to create the kaiju_hit_lineages.tsv
.
$ taxonkit list --data-dir . --ids 1 -I "" \
| taxonkit lineage --data-dir . \
| taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d__ -p "" -s "" \
| cut -f 1,3 \
| head -n 3
1 d__root superkingdom;p__root;c__root;o__root;f__root;g__root;s__root
439684927 d__Archaea;p__Archaea;c__Archaea;o__Archaea;f__Archaea;g__Archaea;s__Archaea
3550141 d__Archaea;p__Nanohaloarchaeota;c__Nanohaloarchaeota;o__Nanohaloarchaeota;f__Nanohaloarchaeota;g__Nanohaloarchaeota;s__Nanohaloarchaeota
Excellent, thank you!
On Mon, Aug 7, 2023 at 4:48 AM Wei Shen @.***> wrote:
Hi Greg, thanks for using.
Since the names are identical, taxonkit has evidently dropped all but one instance of each, which resulted in those ranks getting skipped/collapsed in the nodes.dmp and names.dmp file.
Yes, it did this intentionally. https://github.com/shenwei356/taxonkit/blob/d1981698050abb058572b8ee008ea52ef8a7803b/taxonkit/cmd/create-taxdump.go#L532
Some species, like SpSt-165 sp011057335, do not have all taxon names at each rank, which are more common in viruses.
$ echo 1993618108 \ | taxonkit lineage --data-dir . 1993618108 Bacteria;CSP1-3;SpSt-165;SpSt-165 sp011057335
$ echo 1993618108 \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \ | csvtk cut -Ht -f -2 \ | csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \ | csvtk pretty -t -S bold ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ taxid ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family ┃ genus ┃ species ┃ ┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━╋━━━━━━━╋━━━━━━━╋━━━━━━━━━━╋━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 1993618108 ┃ Bacteria ┃ CSP1-3 ┃ ┃ ┃ SpSt-165 ┃ ┃ SpSt-165 sp011057335 ┃ ┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━┻━━━━━━━┻━━━━━━━┻━━━━━━━━━━┻━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛
While GTDB still shows names of each nodes, maybe for readability. While if we add them to taxdump files, that would add complexity.
$ zcat ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz \ | grep sp011057335 \ | head -n 1 \ | cut -f 2 dBacteria;pCSP1-3;cCSP1-3;oCSP1-3;fSpSt-165;gSpSt-165;s__SpSt-165 sp011057335
The lineage above can also be generated by taxonkit (currently no flag to remove the suffix). taxonkit reformat can fill missing taxon nodes with parent information.
$ echo 1993618108 \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d -p "" \ | cut -f 3 \ | sed 's/ phylum;/;/' | sed 's/ class;/;/' | sed 's/ order;/;/' | sed 's/ family;/;/' | sed 's/ genus;/;/' dBacteria;pCSP1-3;cCSP1-3;oCSP1-3;fSpSt-165;gSpSt-165;sSpSt-165 sp011057335
$ echo 1993618108 \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P --prefix-k d -p "" \ | sed 's/ phylum//' | sed 's/ class//' | sed 's/ order//' | sed 's/ family//' | sed 's/ genus//' \ | csvtk cut -Ht -f -2 \ | csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \ | csvtk pretty -t -S bold ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ taxid ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family ┃ genus ┃ species ┃ ┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 1993618108 ┃ dBacteria ┃ pCSP1-3 ┃ cCSP1-3 ┃ oCSP1-3 ┃ fSpSt-165 ┃ gSpSt-165 ┃ sSpSt-165 sp011057335 ┃ ┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━┛
Maybe you can use this method to create the kaiju_hit_lineages.tsv.
$ taxonkit list --data-dir . --ids 1 -I "" \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d -p "" \ | sed 's/ phylum//' | sed 's/ class//' | sed 's/ order//' | sed 's/ family//' | sed 's/ genus//' | sed 's/ species//' \ | cut -f 1,3 \ | head -n 3 1 droot superkingdom;proot;croot;oroot;froot;groot;sroot 439684927 dArchaea;pArchaea;cArchaea;oArchaea;fArchaea;gArchaea;sArchaea 3550141 dArchaea;pNanohaloarchaeota;cNanohaloarchaeota;oNanohaloarchaeota;fNanohaloarchaeota;gNanohaloarchaeota;sNanohaloarchaeota
— Reply to this email directly, view it on GitHub https://github.com/shenwei356/taxonkit/issues/85#issuecomment-1667455270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKLAKTUMDERA6DTWADDKKG3XUCTXLANCNFSM6AAAAAA3ELN244 . You are receiving this because you authored the thread.Message ID: @.***>
IN SHORT:
When names in a lineage are identical (e.g if genus, family, and order are all called 'SpSt-165') those ranks get collapsed in the names.dmp and nodes.dmp files. This results in incomplete lineages for many clades. Is there a way to put the skipped ranks back into the names and nodes.dmp file?
BACKGROUND:
Using your GTDB R207 names.dmp and nodes.dmp files downloaded from here: https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.3.1/gtdb-taxdump.tar.gz
I made a custom reference database for kaiju, and ran kaiju*, which worked fine. Then I ran kaiju's -addTaxonNames command**, where the output file (kaiju_hit_lineages.tsv) had an issue (commands at end of thread)
ISSUE:
Most assigned lineages are fine, e.g. 'dBacteria; pFirmicutes_B; cDesulfotomaculia; oAmmonifexales; NA; NA; NA; ' (In this example, NA's are expected because it could only be classified to the order)
But about 5% have skipped certain ranks, i.e. those internal ranks are left as 'NA' like... 'dBacteria;pCSP1-3; NA; NA; fSpSt-165; NA; sSpSt-165 sp011057335'
While this lineage should have been: 'dBacteria;pCSP1-3; cCSP1-3; oCSP1-3; fSpSt-165; gSpSt-165; s__SpSt-165 sp011057335'
ROOT CAUSE:
I tracked this problem back to the names.dmp and nodes.dmp files, and found that some lineages are indeed missing internal ranks.
For example, with the above lineage (species = "SpSt-165 sp011057335", taxid=1993618108 ) I started at that species taxid then tried to walk down parent_taxids until I got to the root. Indeed, I found that it skipped from species->family->phylum.
HOW WIDESPREAD IS THE PROBLEM?:
These gaps seem to be found in any lineage where some of the taxon names are the same. E.g. In this example lineage:
genus, family, and order = 'SpSt-165' class and phylum = 'CSP1-3'
Since the names are identical, taxonkit has evidently dropped all but one instance of each, which resulted in those ranks getting skipped/collapsed in the nodes.dmp and names.dmp file.
Is there a way to put the skipped ranks back into the names and nodes.dmp file?
Apologies for my longwinded explanation here. It took me awhile to pinpoint this issue. That being said, I very much appreciate taxonkit and being able to use GTDB clades as if they had taxids.
Thanks,
~Greg
COMMANDS: _____
** -addTaxonNames command: kaiju-addTaxonNames -n names.dmp -i kaiju_hits -o kaiju_hit_lineages.tsv -r superkingdom,phylum,class,order,family,genus,species