shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

TODO: new command to create taxdump files for MAG genome collections #56

Closed shenwei356 closed 2 years ago

shenwei356 commented 2 years ago

It's similar to gtdb_to_taxdump, but more generalized to support MGV.

The input would be:

genome-id    species-id   kingdom  phylum   class   order  family  genus  species
---------    ----------   ------------------------------------------------------
needed       optional     optional

# At least one of the species-id and lineage is needed
shenwei356 commented 2 years ago

Supporting stable/persistent TaxIds? So it can be tracked.

The TaxId of a genome/assembly is easily computed by hashing the genome_id to a uint32.

# GTDB
# assembly accessions are stable in NCBI
GCA_000016605.1 -> hash(GCA_000016605) -> uint32
GCF_000011005.1 -> hash(GCF_000011005) -> uint32

# MGV
MGV-GENOME-0364295  -> hash(MGV-GENOME-0364295) or hash(0364295)

# GPD
ivig_1   -> hash(ivig_1)
uvig_6   -> hash(uvig_6)

# HumGut provides id to NCBI/GTDB mapping file.
shenwei356 commented 2 years ago

While for taxa at species or above ranks, hierarchical lineage information is needed to make them unique and stable.

GTDB

$ zcat bac120_taxonomy_r202.tsv.gz \
    | csvtk cut -Ht -f 2 \
    | csvtk uniq -Ht \
    | head -n 5
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella;s__Klebsiella pneumoniae
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae

MGV

Things are a little complicated.

We assigned viruses to families from the ICTV data- base based on alignments to genomes from NCBI GenBank and crAss-like viruses from recent studies34,45,46 (Fig. 1d). Only 56.6% of viruses could be annotated at the family level, confirming a large knowledge gap in the taxonomy of human gut viruses

$ csvtk cut -t -F -f ictv_* mgv_contig_info.tsv \
    | csvtk uniq -t -F -f ictv_* \
    | head -n 6 \
    | csvtk pretty -t 
ictv_order     ictv_family    ictv_genus
------------   ------------   ----------
Caudovirales   crAss-phage    NULL
Caudovirales   Siphoviridae   NULL
Caudovirales   NULL           NULL
Caudovirales   Myoviridae     NULL
Caudovirales   Myoviridae     Lilyvirus

Where crAss-phage is not found in ICTV.

shenwei356 commented 2 years ago

Note that, some species do not have complete lineage, e.g., GCA_018897955.1 only has Kingdom, Phylum, and Species.

GB_GCA_018897955.1      d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155

While GCA_016192455.1 even does not have Phylum.

GB_GCA_016192455.1      d__Bacteria;p__JACPUC01;c__JACPUC01;o__JACPUC01;f__JACPUC01;g__JACPUC01;s__JACPUC01 sp016192455

Two special cases where the Class and Genus have the same name B47-G6, and the Order and Family between them have different names.

https://gtdb.ecogenomic.org/genome?gid=GCA_003663585.1
GB_GCA_003663585.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585

https://gtdb.ecogenomic.org/genome?gid=GCA_003663565.1
GB_GCA_003663565.1      d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663565
shenwei356 commented 2 years ago

Here's the alpha version:

Currently, the merged.dmp and delnodes.dmp are empty, It needs to be computed by comparing two versions of the GTDB taxdump. I'll work on it tomorrow.

gtdb.r207.taxdump.zip

Usage ```text $ taxonkit create-taxdump -h Create NCBI-style taxdump files for custom taxonomy, e.g., GTDB Input format: 0. For GTDB taxonomy file, just use --gtdb 1. The input file should be tab-delimited 2. At least one column is needed, please specify the filed index: 1) Kingdom/Superkingdom/Domain, -K/--field-kingdom 2) Phylum, -P/--field-phylum 3) Class, -C/--field-class 4) Order, -O/--field-order 5) Family, -F/--field-family 6) Genus, -G/--field-genus 7) Species (needed), -S/--field-species 8) Subspecies, -T/--field-subspecies For GTDB, we use the assembly accession (without version number). Attentions: 1. Names should be distinct in taxa of different rank. But for these missing some taxon nodes, using names of parent nodes is allowed: GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155 It can also detect duplicate names with different ranks, e.g., The Class and Genus have the same name B47-G6, and the Order and Family between them have different names. In this case, we reassign TaxId by increasing the TaxId until it being distinct. GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585 Usage: taxonkit create-taxdump [flags] Flags: -C, --field-class int field index of class -F, --field-family int field index of family -G, --field-genus int field index of genus -K, --field-kingdom int field index of kingdom -O, --field-order int field index of order -P, --field-phylum int field index of phylum -S, --field-species int field index of species (needed) -T, --field-subspecies int field index of subspecies --force overwrite existed output directory --gtdb input files are GTDB taxonomy file --gtdb-re-subs string regular expression to extract accession as the subspecies from the assembly ID (default "^\\w\\w_(.+)\\.\\d+$") -h, --help help for create-taxdump --line-chunk-size int number of lines to process for each thread, and 4 threads is fast enough. (default 5000) --null strings null value of taxa (default [,NULL,NA]) --out-dir string output directory --rank-names strings names of the 8 ranks, order maters (default [superkingdom,phylum,class,order,family,genus,species,no rank]) ```

Try it

$ echo Escherichia coli | taxonkit name2taxid --data-dir gtdb.r207.taxdump 
Escherichia coli        4093283224

$ echo 4093283224 | taxonkit lineage --data-dir gtdb.r207.taxdump -r 
4093283224      Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli    species

$ taxonkit list --data-dir gtdb.r207.taxdump --ids 4093283224 -r -n | head -n 5
4093283224 [species] Escherichia coli
  61542 [no rank] GCA_000176655
  167432 [no rank] GCF_002458255
  477667 [no rank] GCA_000193895
  502475 [no rank] GCF_910592735

Taxid changelog

Though merged.dmp and delnodes.dmp are not available right now, we can still generate taxid-changelog:

$ taxonkit taxid-changelog -i . -o gtdb-taxid-changelog.csv.gz --verbose

Let's see an Escherichia flexneri; assembly which should be merged into the Escherichia coli species (https://forum.gtdb.ecogenomic.org/t/announcing-gtdb-r07-rs207/264).

$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep  -f taxid -p 167432
167432,gtdb.r202.taxdump,NEW,,GCF_002458255,no rank,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri;GCF_002458255,609216830;3788559933;329474883;3160438580;2234733759;3334977531;3912920909;167432
167432,gtdb.r207.taxdump,CHANGE_LIN_TAX,,GCF_002458255,no rank,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;GCF_002458255,609216830;3788559933;329474883;3160438580;2234733759;3334977531;4093283224;167432

So here, 3912920909 is merged into 4093283224.

shenwei356 commented 2 years ago

I think I did it. Please visit: https://github.com/shenwei356/gtdb-taxdump

Please try the new version: https://github.com/shenwei356/taxonkit/releases/tag/v0.11.0-alpha

jolespin commented 7 months ago

Is it possible to create a "custom" taxdump where the taxids are strings and the taxonomy info includes just class, order, family, genus, and species?

shenwei356 commented 7 months ago

taxonomy info includes just class, order, family, genus, and species

You can define whatever rank you want.

taxdump where the taxids are strings

That would not be the taxdump files. Sounds like GTDB taxonomy file? You might replace the "taxid" with something else?

jolespin commented 7 months ago

I've combined a bunch of different databases together but some do not have ncbi_taxid fields. Here's an example of what my table looks like:

id_source   dataset ncbi_taxid  lineage class   order   family  genus   species strain  notes   resolved_lineage
Aalte1  MycoCosm    5599    d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata    Dothideomycetes Pleosporales    Pleosporaceae   Alternaria  Alternaria alternata            True
Aaoar1  MycoCosm    1450171 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii   DothideomycetesPleosporales     Aaosphaeria Aaosphaeria arxii           False
Abobi1  MycoCosm    137743  d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis  Agaricomycetes  Polyporales Podoscyphaceae  Abortiporus Abortiporus biennis         True
Abobie1 MycoCosm    137743  d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis  Agaricomycetes  Polyporales Podoscyphaceae  Abortiporus Abortiporus biennis         True
Abscae1 MycoCosm    90261   d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea Mucoromycetes   Mucorales   Cunninghamellaceae  Absidia Absidia caerulea            True
Absrep1 MycoCosm    90262   d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens   Mucoromycetes   Mucorales   Cunninghamellaceae  Absidia Absidia repens          True
Acain1  MycoCosm    215250  d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii   Exobasidiomycetes   Exobasidiales   Cryptobasidiaceae   Acaromyces  Acaromyces ingoldii         True
Acastr1 MycoCosm    1307806 d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata   Lecanoromycetes Acarosporales   Acarosporaceae  Acarospora  Acarospora strigata         True
Acema1  MycoCosm    886606  d__Eukaryota;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Mollisiaceae;g__Acephala;s__Acephala macrosclerotiorum Leotiomycetes   Helotiales  Mollisiaceae    Acephala    Acephala macrosclerotiorum          True

Unfortunately, some have missing fields for different taxonomic levels.

shenwei356 commented 7 months ago

some do not have ncbi_taxid fields

It's not a problem.

Unfortunately, some have missing fields for different taxonomic levels.

It's OK. See the last example.

$ cat data.tsv \
    | csvtk fix -t \
    | csvtk cut -t -f class,order,family,genus,species \
    | csvtk pretty -t

class                         order           family               genus               species                   
---------------------------   -------------   ------------------   -----------------   --------------------------
Dothideomycetes               Pleosporales    Pleosporaceae        Alternaria          Alternaria alternata      
DothideomycetesPleosporales                   Aaosphaeria          Aaosphaeria arxii                             
Agaricomycetes                Polyporales     Podoscyphaceae       Abortiporus         Abortiporus biennis       
Agaricomycetes                Polyporales     Podoscyphaceae       Abortiporus         Abortiporus biennis       
Mucoromycetes                 Mucorales       Cunninghamellaceae   Absidia             Absidia caerulea          
Mucoromycetes                 Mucorales       Cunninghamellaceae   Absidia             Absidia repens            
Exobasidiomycetes             Exobasidiales   Cryptobasidiaceae    Acaromyces          Acaromyces ingoldii       
Lecanoromycetes               Acarosporales   Acarosporaceae       Acarospora          Acarospora strigata       
Leotiomycetes                 Helotiales      Mollisiaceae         Acephala            Acephala macrosclerotiorum

$ cat data.tsv \
    | csvtk fix -t \
    | csvtk cut -t -f species \
    | taxonkit --data-dir test name2taxid \
    | taxonkit --data-dir test lineage -r -i 2
    | csvtk rename -t -f 1-4 -n query,taxid,lineage,rank \
    | csvtk pretty -t -W 40 -x ';'

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ query                      ┃ taxid      ┃ lineage                                  ┃ rank    ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Alternaria alternata       ┃ 1551135023 ┃ Dothideomycetes;Pleosporales;            ┃ species ┃
┃                            ┃            ┃ Pleosporaceae;Alternaria;                ┃         ┃
┃                            ┃            ┃ Alternaria alternata                     ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Abortiporus biennis        ┃ 1095247865 ┃ Agaricomycetes;Polyporales;              ┃ species ┃
┃                            ┃            ┃ Podoscyphaceae;Abortiporus;              ┃         ┃
┃                            ┃            ┃ Abortiporus biennis                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Abortiporus biennis        ┃ 1095247865 ┃ Agaricomycetes;Polyporales;              ┃ species ┃
┃                            ┃            ┃ Podoscyphaceae;Abortiporus;              ┃         ┃
┃                            ┃            ┃ Abortiporus biennis                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Absidia caerulea           ┃ 2044322620 ┃ Mucoromycetes;Mucorales;                 ┃ species ┃
┃                            ┃            ┃ Cunninghamellaceae;Absidia;              ┃         ┃
┃                            ┃            ┃ Absidia caerulea                         ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Absidia repens             ┃ 2093321634 ┃ Mucoromycetes;Mucorales;                 ┃ species ┃
┃                            ┃            ┃ Cunninghamellaceae;Absidia;              ┃         ┃
┃                            ┃            ┃ Absidia repens                           ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acaromyces ingoldii        ┃ 1276596547 ┃ Exobasidiomycetes;Exobasidiales;         ┃ species ┃
┃                            ┃            ┃ Cryptobasidiaceae;Acaromyces;            ┃         ┃
┃                            ┃            ┃ Acaromyces ingoldii                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acarospora strigata        ┃ 643580302  ┃ Lecanoromycetes;Acarosporales;           ┃ species ┃
┃                            ┃            ┃ Acarosporaceae;Acarospora;               ┃         ┃
┃                            ┃            ┃ Acarospora strigata                      ┃         ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acephala macrosclerotiorum ┃ 1403990922 ┃ Leotiomycetes;Helotiales;Mollisiaceae;   ┃ species ┃
┃                            ┃            ┃ Acephala;Acephala macrosclerotiorum      ┃         ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━┛

$ echo Aaosphaeria arxii \
    | taxonkit --data-dir test name2taxid \
    | taxonkit --data-dir test reformat -I 2 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
    | csvtk add-header -t -n query,taxid,kindom,phylum,class,order,family,genus,species \
    | csvtk pretty -t

query               taxid        kindom   phylum   class                         order   family        genus               species
-----------------   ----------   ------   ------   ---------------------------   -----   -----------   -----------------   -------
Aaosphaeria arxii   1933378114                     DothideomycetesPleosporales           Aaosphaeria   Aaosphaeria arxii 

Cheers! :beers:

jolespin commented 7 months ago

Am I doing this correctly?

Here is test.tsv:

gzip -d -c source_taxonomy.tsv.gz | head -n 10 > test.tsv
id_source   dataset ncbi_taxid  lineage class   order   family  genus   species strain  notes   resolved_lineage
Aalte1  MycoCosm    5599    d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata    Dothideomycetes Pleosporales    Pleosporaceae   Alternaria  Alternaria alternata            True
Aaoar1  MycoCosm    1450171 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii   Dothideomycetes Pleosporales        Aaosphaeria Aaosphaeria arxii           False
Abobi1  MycoCosm    137743  d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis  Agaricomycetes  Polyporales Podoscyphaceae  Abortiporus Abortiporus biennis         True
Abobie1 MycoCosm    137743  d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis  Agaricomycetes  Polyporales Podoscyphaceae  Abortiporus Abortiporus biennis         True
Abscae1 MycoCosm    90261   d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea Mucoromycetes   Mucorales   Cunninghamellaceae  Absidia Absidia caerulea            True
Absrep1 MycoCosm    90262   d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens   Mucoromycetes   Mucorales   Cunninghamellaceae  Absidia Absidia repens          True
Acain1  MycoCosm    215250  d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii   Exobasidiomycetes   Exobasidiales   Cryptobasidiaceae   Acaromyces  Acaromyces ingoldii     True
acanthamoeba_castellanii_str_neff_gca_000313135 EnsemblProtists 1257118 d__Eukaryota;p__Discosea;c__;o__Longamoebia;f__Acanthamoebidae;g__Acanthamoeba;s__Acanthamoeba castellanii      Longamoebia Acanthamoebidae Acanthamoeba    Acanthamoeba castellanii    Acanthamoeba castellanii str. Neff      False
Acastr1 MycoCosm    1307806 d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata   Lecanoromycetes Acarosporales   Acarosporaceae  Acarospora  Acarospora strigata         True

Now remove header, get source name and lineage, then pipe into taxonkit create-taxdump:

Aalte1  d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata
Aaoar1  d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii
Abobi1  d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis
Abobie1 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis
Abscae1 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea
Absrep1 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens
Acain1  d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii
acanthamoeba_castellanii_str_neff_gca_000313135 d__Eukaryota;p__Discosea;c__;o__Longamoebia;f__Acanthamoebidae;g__Acanthamoeba;s__Acanthamoeba castellanii
Acastr1 d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata

Piping the above into taxonkit:

cat test.tsv | tail -n +2 | cut -f1,4 | taxonkit create-taxdump --gtdb -O test_output/
17:37:32.975 [WARN] --gtdb-re-subs failed to extract ID for subspecies, the origninal value is used instead. e.g., Acastr1
17:37:32.978 [INFO] 9 records saved to test_output/taxid.map
17:37:32.978 [INFO] 47 records saved to test_output/nodes.dmp
17:37:32.979 [INFO] 47 records saved to test_output/names.dmp
17:37:32.979 [INFO] 0 records saved to test_output/merged.dmp
17:37:32.979 [INFO] 0 records saved to test_output/delnodes.dmp

Is this the correct usage?

When I try it with the full dataset, i get this error:

gzip -d -c source_taxonomy.tsv.gz | tail -n +2 | cut -f1,4 | taxonkit create-taxdump --gtdb -O taxdump/
17:38:53.517 [ERRO] invalid GTDB taxonomy record: MicG_I_3
shenwei356 commented 7 months ago

Well, it becomes a little bit complex.

It seems that the line containing MicG_I_3 is not in the format of d_xx;p_xxx. Is it from other sources?

At the first glance of the input data, I thought it had complete information of all ranks. But it seems that domain and phylum are missing.

$ csvtk headers -t t.tsv
id_source
dataset
ncbi_taxid
lineage
class
order
family
genus
species
strain
notes
resolved_lineage

For rows with lineage in format of "d_xxx,p_xxx", domain and phylum could be extracted. So you can use code I pasted below. If the line of MicG_I_3 do not have enough lineage information, it's bad.

jolespin commented 7 months ago

Damn, looks like that one is missing a lineage field.

What would the command be if I had the following columns:

id_source, class, order, family, genus, species, strain (some fields might be empty here but not all fields)

gzip -d -c source_taxonomy.tsv.gz | tail -n +2 | cut -f1,5,6,7,8,9,10 | taxonkit create-taxdump

Would -A 0 and -R be the next args?

shenwei356 commented 7 months ago
zcat source_taxonomy.tsv.gz \
    | csvtk fix -t \
    | csvtk cut -t -f id_source,class,order,family,genus,species,strain \
    | taxonkit create-taxdump -A 1 -O taxdump
jolespin commented 7 months ago

Do I need to specify to create-taxdump what the class, order, family, genus, species, and strain columns are or does it autodetect?

shenwei356 commented 7 months ago

https://bioinf.shenwei.me/taxonkit/usage/#create-taxdump

Please check the usage and example 4.

Input formats:

  2. Ranks can be given either via the first row or the flag --rank-names.

Flags:

  -R, --rank-names strings          names of all ranks, leave it empty to use the first row of input as
                                    rank names