shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

Placeholder for Clade rank (reformat) #77

Closed poursalavati closed 1 year ago

poursalavati commented 1 year ago

Hi Dear Wei,

I was wondering to see if there is a placeholder for clade rank?

I'm using this command and looking to have 8 columns for output (now missing clade rank):

./taxonkit reformat -I 3 --data-dir ../taxdump/ joined -F -f "{K}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" > joined_lin_new

Here is part of the input file (joined):

GCA_000320725.1 GCA_000320725.1_99      1077221 1077221
GCA_000529295.1 GCA_000529295.1_1       10454   10454
GCA_000529295.1 GCA_000529295.1_10      10454   10454
shenwei356 commented 1 year ago

clade could be anywhere in the taxonomic tree, so it's not specific.

poursalavati commented 1 year ago

Thanks, In NCBI format, its the 2nd rank, for example in this virus lineage:

GCA_000320725.1       Varidnaviria    Bamfordvirae    Nucleocytoviricota      Megaviricetes   Imitervirales   Mimiviridae     unclassified Mimiviridae genus  Acanthamoeba polyphaga lentillevirus

Varidnaviria could not be exported when using: ./taxonkit reformat -I 3 --data-dir ../taxdump/ joined -F -f "{K}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}"

current output:

GCA_000320725.1      Bamfordvirae    Nucleocytoviricota      Megaviricetes   Imitervirales   Mimiviridae     unclassified Mimiviridae genus  Acanthamoeba polyphaga lentillevirus

Im looking for a placeholder that could extract Varidnaviria (clade rank).

shenwei356 commented 1 year ago

I see, it seems that there always is a clade between superkingdom and kingdom. However, there are also thousands of taxa having two clade, which makes the clade ambiguous.

$ echo "2506204" \
    | taxonkit lineage -t \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
10239     superkingdom   Viruses
2731342   clade          Monodnaviria
2732092   kingdom        Shotokuvirae
2732415   phylum         Cossaviricota
2732421   class          Papovaviricetes
2732533   order          Zurhausenvirales
151340    family         Papillomaviridae
333774    no rank        unclassified Papillomaviridae
333933    clade          primate papillomaviruses
2506204   species        Macaca fuscata papillomavirus 2

$ taxonkit list --ids 1 \ | taxonkit filter -L species -E species \ | taxonkit lineage -R \ | grep clade \ | pigz -c \

clades.gz

$ zcat clades.gz \ | grep Viruses \ | grep -E "clade.*clade" \ | wc -l 17888

t.txt

poursalavati commented 1 year ago

Thanks Wei, Yes, you're right. In this case, seems NCBI should change its rank behavior since for this example we need a "Realm" instead of a clade (Based on ICTV). Anyway, I wrote a script that fixes it ugly! It adds a new column based on the kingdom, and writes the appropriate Realm (clade) name. Best, NP