Closed Midnighter closed 2 years ago
The code I use before is the one shown in https://github.com/shenwei356/taxonkit/issues/62#issue-1333192129 which I then process further using
taxonkit create-taxdump --out-dir fungi --force
The input file should be tab-delimited, with each column as a taxon node, not the complete lineage.
You can use tab in format as "{k}\t{..." in "taxonkit reformat" to create such input file.
and |
is not needed, we need tab.
Thank you, that did the trick. I was looking at this example in the usage docs
1. Names should be distinct in taxa of different ranks.
But for these missing some taxon nodes, using names of parent nodes is allowed:
GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155
It can also detect duplicate names with different ranks, e.g.,
the Class and Genus have the same name B47-G6, and the Order and Family
between them have different names. In this case, we reassign a new TaxId
by increasing the TaxId until it being distinct.
GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585
and assumed this is the desired format. I guess, that's only for --gtdb
, though.
My final command for posterity to extract all fungi as a simplified taxonomy:
taxonkit list --ids 4751 --indent "" \
| taxonkit reformat --taxid-field 1 --output-ambiguous-result --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
| cut --fields=2-8 \
| taxonkit create-taxdump --outdir fungi --force --rank-names "superkingdom,phylum,class,order,family,genus,species"
Actually, one more question, is it at all possible to preserve the taxon IDs rather than creating new ones?
is it at all possible to preserve the taxon IDs rather than creating new ones?
Not applicable.
I see, looks like you want to create a simplified NCBI taxdump dataset for Fungi.
Maybe you just need to use the original one and format the lineages to 7-rank-format after your analysis.
I see, looks like you want to create a simplified NCBI taxdump dataset for Fungi.
Indeed, that's exactly my goal.
Maybe you just need to use the original one and format the lineages to 7-rank-format after your analysis.
My issue is that I wanted to end up with files in the format of names.dmp
and nodes.dmp
after simplifying to seven ranks. So I guess, what I'll do is to use reformat outputting the taxon IDs in tab-delimited format, then creating a taxdump from that, using the names.dmp
to map the newly created IDs back to the old ones, and then using the original NCBI names.dmp
for the scientific names.
Prerequisites
taxonkit v0.12.0
Describe your issue
I have a few issues with using
taxonkit create-taxdump
afterreformat
.When I provide two column input consisting of tax ID and lineage the
names.dmp
contains the full lineage as the scientific name, example,and the
nodes.dmp
contains the lineage from the first row in the rank field, example,When I provide
--rank-names "superkingdom,phylum,class,order,family,genus,species"
, I get an error on the first line sayingthis happens whether or not I use the
--trim
option toreformat
.When I provide a reformatted lineage file that contains also the scientific name, rank, and tax ID lineage as columns,
taxonkit create-taxdump
takes much longer to create an output (I have yet to see if it completes).