shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
361 stars 29 forks source link

no lineage info when using custom nodes+names dmp #26

Closed nick-youngblut closed 4 years ago

nick-youngblut commented 4 years ago

I created a script to convert the Genome Taxonomy Database (GTDB) taxonomy to nodes.dmp + names.dmp files. The output looks like:

names.dmp

1   |   all |       |   synonym |
1   |   root    |       |   scientific_name |
2   |   d__Archaea  |       |   scientific_name |
3   |   p__Halobacterota    |       |   scientific_name |
4   |   c__Methanosarcinia  |       |   scientific_name |
5   |   o__Methanosarcinales    |       |   scientific_name |
6   |   f__Methanosarcinaceae   |       |   scientific_name |
7   |   g__Methanosarcina   |       |   scientific_name |
8   |   s__Methanosarcina mazei |       |   scientific_name |
9   |   RS_GCF_000979745.1  |       |   scientific_name |
10  |   RS_GCF_000980175.1  |       |   scientific_name |
11  |   RS_GCF_000980005.1  |       |   scientific_name |
12  |   RS_GCF_000979595.1  |       |   scientific_name |
13  |   RS_GCF_000979555.1  |       |   scientific_name |
14  |   RS_GCF_000979915.1  |       |   scientific_name |
15  |   RS_GCF_000970165.1  |       |   scientific_name |
16  |   RS_GCF_000979125.1  |       |   scientific_name |
17  |   RS_GCF_000979015.1  |       |   scientific_name |
18  |   RS_GCF_000979925.1  |       |   scientific_name |
19  |   RS_GCF_000980105.1  |       |   scientific_name |

nodes.dmp

1   |   1   |   no rank |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
2   |   1   |   superkingdom    |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
3   |   2   |   phylum  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
4   |   3   |   class   |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
5   |   4   |   order   |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
6   |   5   |   family  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
7   |   6   |   genus   |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
8   |   7   |   species |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
9   |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
10  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
11  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
12  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
13  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
14  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
15  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
16  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
17  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
18  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
19  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |
20  |   8   |   subspecies  |   XX  |   0   |   0   |   11  |   1   |   1   |   0   |   0   |   0   |

taxonkit list works as expected, but taxonkit lineage does not provide any lineage info. For example:

1       no rank
2       superkingdom
10  ;;;;;;; subspecies
397 ;;; order
982 ;;;;;   genus
541 ;;;;    family
3844    ;;;;;;  species

Any idea why I'm not getting the full lineage info? I tried to look at the taxonkit code to see if it was filtering based on the embl code or something else, but I don't see what's the problem (it doesn't help that I don't know go).

shenwei356 commented 4 years ago

names.dmp: "scientific name" not "scientific_name"

$ more names.dmp 
1       |       all     |               |       synonym |
1       |       root    |               |       scientific name |
2       |       Bacteria        |       Bacteria <bacteria>     |       scientific name |
2       |       Monera  |       Monera <bacteria>       |       in-part |
nick-youngblut commented 4 years ago

That did it. Thanks!