shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
378 stars 30 forks source link

Is it expected to have different output from reformat depending on whether the input are only tax IDs or a lineage table? #62

Closed Midnighter closed 2 years ago

Midnighter commented 2 years ago

Prerequisites

Describe your issue

I want to create a simplified taxonomy of all fungi. It seems that with the latest version of taxonkit, I have two options:

  1. Create a lineage file from a bunch of taxonomy IDs and then reformat it.

    taxonkit list --ids 4751 --indent "" | \
        taxonkit lineage | \
        taxonkit reformat \
        --taxid-field 1 \
        --fill-miss-rank \
        --output-ambiguous-result \
        --add-prefix \
        --show-lineage-taxids \
        --format "{k};{p};{c};{o};{f};{g};{s}"

    First line of the output:

    4751    cellular organisms;Eukaryota;Opisthokonta;Fungi k__Eukaryota;p__unclassified Eukaryota phylum;c__unclassified Eukaryota class;o__unclassified Eukaryota order;f__unclassified Eukaryota family;g__unclassified Eukaryota genus;s__unclassified Eukaryota species    2759;;;;;;
  2. Alternatively, I can directly use the identifiers as input to reformat.

    taxonkit list --ids 4751 --indent "" | \
        taxonkit reformat \
        --taxid-field 1 \
        --fill-miss-rank \
        --output-ambiguous-result \
        --add-prefix \
        --show-lineage-taxids \
        --format "{k};{p};{c};{o};{f};{g};{s}"

    First line of the output:

    4751    k__Eukaryota;p__unclassified Eukaryota phylum;c__unclassified Eukaryota class;o__unclassified Eukaryota order;f__unclassified Eukaryota family;g__unclassified Eukaryota genus;s__unclassified Eukaryota species    2759;;;;;;
shenwei356 commented 2 years ago

Yes, you found it. In the beginning, reformat only accept full lineage as input. In detail, it used one node and its parent node to get the taxId, however, this brought some errors for some taxa. So, after v0.8.0, it accepts input of TaxIds via flag -I/--taxid-field.

Midnighter commented 2 years ago

Sorry, my question here is from the title, is it expected that the information output differs between the two? Probably yes, as additional columns from the lineage file are preserved.