shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
378 stars 30 forks source link

Trouble creating taxdump after using reformat #63

Closed Midnighter closed 2 years ago

Midnighter commented 2 years ago

Prerequisites

Describe your issue

I have a few issues with using taxonkit create-taxdump after reformat.

  1. When I provide two column input consisting of tax ID and lineage the names.dmp contains the full lineage as the scientific name, example,

    62762   |   Eukaryota;Ascomycota;Sordariomycetes;Hypocreales;Hypocreaceae;Trichoderma;Trichoderma sp. 3 WYZ-2016    |       |   scientific name |

    and the nodes.dmp contains the lineage from the first row in the rank field, example,

    62762   |   638174346   |   Eukaryota;;;;;; |   XX  |   0   |   1   |   11  |   1   |   0   |   1   |   1   |   0   |       |
  2. When I provide --rank-names "superkingdom,phylum,class,order,family,genus,species", I get an error on the first line saying

    the number (2, expect 7) of columns at line 1 does not match that of rank names (7)

    this happens whether or not I use the --trim option to reformat.

  3. When I provide a reformatted lineage file that contains also the scientific name, rank, and tax ID lineage as columns, taxonkit create-taxdump takes much longer to create an output (I have yet to see if it completes).

Midnighter commented 2 years ago

The code I use before is the one shown in https://github.com/shenwei356/taxonkit/issues/62#issue-1333192129 which I then process further using

taxonkit create-taxdump --out-dir fungi --force
shenwei356 commented 2 years ago

The input file should be tab-delimited, with each column as a taxon node, not the complete lineage.

You can use tab in format as "{k}\t{..." in "taxonkit reformat" to create such input file.

and | is not needed, we need tab.

Midnighter commented 2 years ago

Thank you, that did the trick. I was looking at this example in the usage docs

  1. Names should be distinct in taxa of different ranks.
     But for these missing some taxon nodes, using names of parent nodes is allowed:

       GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

     It can also detect duplicate names with different ranks, e.g.,
     the Class and Genus have the same name B47-G6, and the Order and Family
     between them have different names. In this case, we reassign a new TaxId
     by increasing the TaxId until it being distinct.

       GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

and assumed this is the desired format. I guess, that's only for --gtdb, though.

My final command for posterity to extract all fungi as a simplified taxonomy:

taxonkit list --ids 4751 --indent "" \
    | taxonkit reformat --taxid-field 1 --output-ambiguous-result --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
    | cut --fields=2-8 \
    | taxonkit create-taxdump --outdir fungi --force --rank-names "superkingdom,phylum,class,order,family,genus,species" 
Midnighter commented 2 years ago

Actually, one more question, is it at all possible to preserve the taxon IDs rather than creating new ones?

shenwei356 commented 2 years ago

is it at all possible to preserve the taxon IDs rather than creating new ones?

Not applicable.

I see, looks like you want to create a simplified NCBI taxdump dataset for Fungi.

Maybe you just need to use the original one and format the lineages to 7-rank-format after your analysis.

Midnighter commented 2 years ago

I see, looks like you want to create a simplified NCBI taxdump dataset for Fungi.

Indeed, that's exactly my goal.

Maybe you just need to use the original one and format the lineages to 7-rank-format after your analysis.

My issue is that I wanted to end up with files in the format of names.dmp and nodes.dmp after simplifying to seven ranks. So I guess, what I'll do is to use reformat outputting the taxon IDs in tab-delimited format, then creating a taxdump from that, using the names.dmp to map the newly created IDs back to the old ones, and then using the original NCBI names.dmp for the scientific names.