shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
375 stars 29 forks source link

ranks skipped in certain GTDB lineages #85

Closed ggavelis closed 1 year ago

ggavelis commented 1 year ago

IN SHORT:

When names in a lineage are identical (e.g if genus, family, and order are all called 'SpSt-165') those ranks get collapsed in the names.dmp and nodes.dmp files. This results in incomplete lineages for many clades. Is there a way to put the skipped ranks back into the names and nodes.dmp file?

BACKGROUND:

Using your GTDB R207 names.dmp and nodes.dmp files downloaded from here: https://github.com/shenwei356/gtdb-taxdump/releases/download/v0.3.1/gtdb-taxdump.tar.gz

I made a custom reference database for kaiju, and ran kaiju*, which worked fine. Then I ran kaiju's -addTaxonNames command**, where the output file (kaiju_hit_lineages.tsv) had an issue (commands at end of thread)

ISSUE:

Most assigned lineages are fine, e.g. 'dBacteria; pFirmicutes_B; cDesulfotomaculia; oAmmonifexales; NA; NA; NA; ' (In this example, NA's are expected because it could only be classified to the order)

But about 5% have skipped certain ranks, i.e. those internal ranks are left as 'NA' like... 'dBacteria;pCSP1-3; NA; NA; fSpSt-165; NA; sSpSt-165 sp011057335'

While this lineage should have been: 'dBacteria;pCSP1-3; cCSP1-3; oCSP1-3; fSpSt-165; gSpSt-165; s__SpSt-165 sp011057335'

ROOT CAUSE:

I tracked this problem back to the names.dmp and nodes.dmp files, and found that some lineages are indeed missing internal ranks.

For example, with the above lineage (species = "SpSt-165 sp011057335", taxid=1993618108 ) I started at that species taxid then tried to walk down parent_taxids until I got to the root. Indeed, I found that it skipped from species->family->phylum.

HOW WIDESPREAD IS THE PROBLEM?:

These gaps seem to be found in any lineage where some of the taxon names are the same. E.g. In this example lineage:
genus, family, and order = 'SpSt-165' class and phylum = 'CSP1-3'

Since the names are identical, taxonkit has evidently dropped all but one instance of each, which resulted in those ranks getting skipped/collapsed in the nodes.dmp and names.dmp file.

Is there a way to put the skipped ranks back into the names and nodes.dmp file?

Apologies for my longwinded explanation here. It took me awhile to pinpoint this issue. That being said, I very much appreciate taxonkit and being able to use GTDB clades as if they had taxids.

Thanks,

~Greg

COMMANDS: _____

** -addTaxonNames command: kaiju-addTaxonNames -n names.dmp -i kaiju_hits -o kaiju_hit_lineages.tsv -r superkingdom,phylum,class,order,family,genus,species

shenwei356 commented 1 year ago

Hi Greg, thanks for using.

Since the names are identical, taxonkit has evidently dropped all but one instance of each, which resulted in those ranks getting skipped/collapsed in the nodes.dmp and names.dmp file.

Yes, it did this intentionally. https://github.com/shenwei356/taxonkit/blob/d1981698050abb058572b8ee008ea52ef8a7803b/taxonkit/cmd/create-taxdump.go#L532

Some species, like SpSt-165 sp011057335, do not have all taxon names at each rank, which are more common in viruses.

$ echo 1993618108 \  
    | taxonkit lineage  --data-dir .
1993618108      Bacteria;CSP1-3;SpSt-165;SpSt-165 sp011057335

$ echo 1993618108 \
    | taxonkit lineage  --data-dir . \
    | taxonkit reformat --data-dir . -I 1  -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
    | csvtk cut -Ht -f -2 \
    | csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \
    | csvtk pretty -t -S bold
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓
┃ taxid      ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family   ┃ genus ┃ species              ┃
┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━╋━━━━━━━╋━━━━━━━╋━━━━━━━━━━╋━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫
┃ 1993618108 ┃ Bacteria     ┃ CSP1-3 ┃       ┃       ┃ SpSt-165 ┃       ┃ SpSt-165 sp011057335 ┃
┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━┻━━━━━━━┻━━━━━━━┻━━━━━━━━━━┻━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

While GTDB still shows names of each nodes, maybe for readability. While if we add them to taxdump files, that would add complexity.

$ zcat ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz \
    | grep sp011057335 \
    | head -n 1 \
    | cut  -f 2
d__Bacteria;p__CSP1-3;c__CSP1-3;o__CSP1-3;f__SpSt-165;g__SpSt-165;s__SpSt-165 sp011057335

The lineage above can also be generated by taxonkit reformat (the flag -s newly added in v0.14.3, please use the binaries below). taxonkit reformat can fill missing taxon nodes with parent information.

$ echo 1993618108 \
    | taxonkit lineage  --data-dir . \
    | taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d__ -p "" -s "" \
    | cut -f 3
d__Bacteria;p__CSP1-3;c__CSP1-3;o__CSP1-3;f__SpSt-165;g__SpSt-165;s__SpSt-165 sp011057335

$ echo 1993618108 \
    | taxonkit lineage  --data-dir . \
    | taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}"  -F -P --prefix-k d__ -p "" -s "" \
    | csvtk cut -Ht -f -2 \
    | csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \
    | csvtk pretty -t -S bold
┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ taxid      ┃ superkingdom ┃ phylum    ┃ class     ┃ order     ┃ family      ┃ genus       ┃ species                 ┃
┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━┫
┃ 1993618108 ┃ d__Bacteria  ┃ p__CSP1-3 ┃ c__CSP1-3 ┃ o__CSP1-3 ┃ f__SpSt-165 ┃ g__SpSt-165 ┃ s__SpSt-165 sp011057335 ┃
┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━┛

Maybe you can use this method to create the kaiju_hit_lineages.tsv.

$ taxonkit list --data-dir . --ids 1 -I "" \
    | taxonkit lineage  --data-dir . \
    | taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d__ -p "" -s "" \
    | cut -f 1,3 \
    | head -n 3
1       d__root superkingdom;p__root;c__root;o__root;f__root;g__root;s__root
439684927       d__Archaea;p__Archaea;c__Archaea;o__Archaea;f__Archaea;g__Archaea;s__Archaea
3550141 d__Archaea;p__Nanohaloarchaeota;c__Nanohaloarchaeota;o__Nanohaloarchaeota;f__Nanohaloarchaeota;g__Nanohaloarchaeota;s__Nanohaloarchaeota
ggavelis commented 1 year ago

Excellent, thank you!

On Mon, Aug 7, 2023 at 4:48 AM Wei Shen @.***> wrote:

Hi Greg, thanks for using.

Since the names are identical, taxonkit has evidently dropped all but one instance of each, which resulted in those ranks getting skipped/collapsed in the nodes.dmp and names.dmp file.

Yes, it did this intentionally. https://github.com/shenwei356/taxonkit/blob/d1981698050abb058572b8ee008ea52ef8a7803b/taxonkit/cmd/create-taxdump.go#L532

Some species, like SpSt-165 sp011057335, do not have all taxon names at each rank, which are more common in viruses.

$ echo 1993618108 \ | taxonkit lineage --data-dir . 1993618108 Bacteria;CSP1-3;SpSt-165;SpSt-165 sp011057335

$ echo 1993618108 \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \ | csvtk cut -Ht -f -2 \ | csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \ | csvtk pretty -t -S bold ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┳━━━━━━━┳━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┓ ┃ taxid ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family ┃ genus ┃ species ┃ ┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━╋━━━━━━━╋━━━━━━━╋━━━━━━━━━━╋━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 1993618108 ┃ Bacteria ┃ CSP1-3 ┃ ┃ ┃ SpSt-165 ┃ ┃ SpSt-165 sp011057335 ┃ ┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━┻━━━━━━━┻━━━━━━━┻━━━━━━━━━━┻━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━┛

While GTDB still shows names of each nodes, maybe for readability. While if we add them to taxdump files, that would add complexity.

$ zcat ar53_taxonomy_r207.tsv.gz bac120_taxonomy_r207.tsv.gz \ | grep sp011057335 \ | head -n 1 \ | cut -f 2 dBacteria;pCSP1-3;cCSP1-3;oCSP1-3;fSpSt-165;gSpSt-165;s__SpSt-165 sp011057335

The lineage above can also be generated by taxonkit (currently no flag to remove the suffix). taxonkit reformat can fill missing taxon nodes with parent information.

$ echo 1993618108 \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d -p "" \ | cut -f 3 \ | sed 's/ phylum;/;/' | sed 's/ class;/;/' | sed 's/ order;/;/' | sed 's/ family;/;/' | sed 's/ genus;/;/' dBacteria;pCSP1-3;cCSP1-3;oCSP1-3;fSpSt-165;gSpSt-165;sSpSt-165 sp011057335

$ echo 1993618108 \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -f "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" -F -P --prefix-k d -p "" \ | sed 's/ phylum//' | sed 's/ class//' | sed 's/ order//' | sed 's/ family//' | sed 's/ genus//' \ | csvtk cut -Ht -f -2 \ | csvtk add-header -t -n "taxid,superkingdom,phylum,class,order,family,genus,species" \ | csvtk pretty -t -S bold ┏━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ taxid ┃ superkingdom ┃ phylum ┃ class ┃ order ┃ family ┃ genus ┃ species ┃ ┣━━━━━━━━━━━━╋━━━━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━┫ ┃ 1993618108 ┃ dBacteria ┃ pCSP1-3 ┃ cCSP1-3 ┃ oCSP1-3 ┃ fSpSt-165 ┃ gSpSt-165 ┃ sSpSt-165 sp011057335 ┃ ┗━━━━━━━━━━━━┻━━━━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━┛

Maybe you can use this method to create the kaiju_hit_lineages.tsv.

$ taxonkit list --data-dir . --ids 1 -I "" \ | taxonkit lineage --data-dir . \ | taxonkit reformat --data-dir . -I 1 -F -P --prefix-k d -p "" \ | sed 's/ phylum//' | sed 's/ class//' | sed 's/ order//' | sed 's/ family//' | sed 's/ genus//' | sed 's/ species//' \ | cut -f 1,3 \ | head -n 3 1 droot superkingdom;proot;croot;oroot;froot;groot;sroot 439684927 dArchaea;pArchaea;cArchaea;oArchaea;fArchaea;gArchaea;sArchaea 3550141 dArchaea;pNanohaloarchaeota;cNanohaloarchaeota;oNanohaloarchaeota;fNanohaloarchaeota;gNanohaloarchaeota;sNanohaloarchaeota

— Reply to this email directly, view it on GitHub https://github.com/shenwei356/taxonkit/issues/85#issuecomment-1667455270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKLAKTUMDERA6DTWADDKKG3XUCTXLANCNFSM6AAAAAA3ELN244 . You are receiving this because you authored the thread.Message ID: @.***>