shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
369 stars 29 forks source link

Extracting taxids for each desired taxonomic ranks into their own columns #96

Closed charlesfoster closed 5 months ago

charlesfoster commented 5 months ago

Thanks for the great tool. I'm currently using it like so, based on the wiki:

$ taxonkit lineage taxids.txt | taxonkit reformat -I 1 -t -r NA -R 0 | csvtk -H -t cut -f 1,3 | csvtk -H -t sep -f 2 -s ';' -R | csvtk add-header -t -n taxid,kingdom,phylum,class,order,family,genus,species | csvtk pretty -t | head

Output:

16:22:20.217 [WARN] taxid 0 not found
16:22:20.576 [WARN] taxid 0 not found
taxid     kingdom    phylum               class                order             family               genus                 species
-------   -------   ------------------   ------------------   ---------------   ------------------   -------------------   -------------------------------
0         NA        NA                   NA                   NA                NA                   NA                    NA
1000824   Viruses   Negarnaviricota      Insthoviricetes      Articulavirales   Orthomyxoviridae     Alphainfluenzavirus   Alphainfluenzavirus influenzae
101753    Viruses   Negarnaviricota      Insthoviricetes      Articulavirales   Orthomyxoviridae     Alphainfluenzavirus   Alphainfluenzavirus influenzae
10239     Viruses   NA                   NA                   NA                NA                   NA                    NA

Is there an easy option I am missing to also add in the corresponding taxids of each of the desired ranks into their own columns?

When I run the following, for example, I get the taxids for each rank:

$ echo "1000824"     | taxonkit lineage | taxonkit reformat -I 1 -t  -r NA -R 0 
1000824 Viruses;Riboviria;Orthornavirae;Negarnaviricota;Polyploviricotina;Insthoviricetes;Articulavirales;Orthomyxoviridae;Alphainfluenzavirus;Alphainfluenzavirus influenzae;Influenza A virus;H3N2 subtype;Influenza A virus (A/Uganda/MUWRP-015/2008(H3N2))  Viruses;Negarnaviricota;Insthoviricetes;Articulavirales;Orthomyxoviridae;Alphainfluenzavirus;Alphainfluenzavirus influenzae 10239;2497569;2497577;2499411;11308;197911;2955291

However, I'm just not sure how to take this to the next step, and also combine it with my initial commands. Ideally, I would be able to get the headers "taxid,kingdom,phylum,class,order,family,genus,species,kingdom_taxid,phylum_taxid,class_taxid,order_taxid,family_taxid,genus_taxid,species_taxid".

If need be I will write a longer script (e.g. python) to do this step, but just checking to make sure I'm not missing something using your great tools.

Thanks.

shenwei356 commented 5 months ago

taxonkit reformat can output taxids.

$ taxonkit reformat -h
-t, --show-lineage-taxids            show corresponding taxids of reformated lineage

$ echo 562 \
    | taxonkit reformat -I 1 -t -r NA -R 0
562     Bacteria;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli    2;1224;1236;91347;543;561;562

So you just need to separate them into multiple columns.

echo 562 \
    | taxonkit reformat -I 1 -t -r NA -R 0 \
    | csvtk -H -t sep -f 2 -s ';' -R  \
    | csvtk -H -t sep -f 2 -s ';' -R \
    | csvtk add-header -t -n "taxid,kingdom,phylum,class,order,family,genus,species,kingdom_taxid,phylum_taxid,class_taxid,order_taxid,family_taxid,genus_taxid,species_taxid" \
    | csvtk pretty -t
taxid   kingdom    phylum           class                 order              family               genus         species            kingdom_taxid   phylum_taxid   class_taxid   order_taxid   family_taxid   genus_taxid   species_taxid
-----   --------   --------------   -------------------   ----------------   ------------------   -----------   ----------------   -------------   ------------   -----------   -----------   ------------   -----------   -------------
562     Bacteria   Pseudomonadota   Gammaproteobacteria   Enterobacterales   Enterobacteriaceae   Escherichia   Escherichia coli   2               1224           1236          91347         543            561           562 
charlesfoster commented 5 months ago

Perfect! Thank you for the swift and helpful response.