Closed shenwei356 closed 2 years ago
Supporting stable/persistent TaxIds? So it can be tracked.
The TaxId of a genome/assembly is easily computed by hashing the genome_id
to a uint32
.
# GTDB
# assembly accessions are stable in NCBI
GCA_000016605.1 -> hash(GCA_000016605) -> uint32
GCF_000011005.1 -> hash(GCF_000011005) -> uint32
# MGV
MGV-GENOME-0364295 -> hash(MGV-GENOME-0364295) or hash(0364295)
# GPD
ivig_1 -> hash(ivig_1)
uvig_6 -> hash(uvig_6)
# HumGut provides id to NCBI/GTDB mapping file.
While for taxa at species or above ranks, hierarchical lineage information is needed to make them unique and stable.
$ zcat bac120_taxonomy_r202.tsv.gz \
| csvtk cut -Ht -f 2 \
| csvtk uniq -Ht \
| head -n 5
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Escherichia;s__Escherichia flexneri
d__Bacteria;p__Firmicutes;c__Bacilli;o__Staphylococcales;f__Staphylococcaceae;g__Staphylococcus;s__Staphylococcus aureus
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Salmonella;s__Salmonella enterica
d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Enterobacterales;f__Enterobacteriaceae;g__Klebsiella;s__Klebsiella pneumoniae
d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Streptococcaceae;g__Streptococcus;s__Streptococcus pneumoniae
Things are a little complicated.
We assigned viruses to families from the ICTV data- base based on alignments to genomes from NCBI GenBank and crAss-like viruses from recent studies34,45,46 (Fig. 1d). Only 56.6% of viruses could be annotated at the family level, confirming a large knowledge gap in the taxonomy of human gut viruses
$ csvtk cut -t -F -f ictv_* mgv_contig_info.tsv \
| csvtk uniq -t -F -f ictv_* \
| head -n 6 \
| csvtk pretty -t
ictv_order ictv_family ictv_genus
------------ ------------ ----------
Caudovirales crAss-phage NULL
Caudovirales Siphoviridae NULL
Caudovirales NULL NULL
Caudovirales Myoviridae NULL
Caudovirales Myoviridae Lilyvirus
Where crAss-phage
is not found in ICTV.
Note that, some species do not have complete lineage, e.g., GCA_018897955.1 only has Kingdom, Phylum, and Species.
GB_GCA_018897955.1 d__Archaea;p__EX4484-52;c__EX4484-52;o__EX4484-52;f__LFW-46;g__LFW-46;s__LFW-46 sp018897155
While GCA_016192455.1 even does not have Phylum.
GB_GCA_016192455.1 d__Bacteria;p__JACPUC01;c__JACPUC01;o__JACPUC01;f__JACPUC01;g__JACPUC01;s__JACPUC01 sp016192455
Two special cases where the Class and Genus have the same name B47-G6
, and the Order and Family between them have different names.
https://gtdb.ecogenomic.org/genome?gid=GCA_003663585.1
GB_GCA_003663585.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663585
https://gtdb.ecogenomic.org/genome?gid=GCA_003663565.1
GB_GCA_003663565.1 d__Archaea;p__Thermoplasmatota;c__B47-G6;o__B47-G6B;f__47-G6;g__B47-G6;s__B47-G6 sp003663565
Here's the alpha version:
Currently, the merged.dmp
and delnodes.dmp
are empty, It needs to be computed by comparing two versions of the GTDB taxdump. I'll work on it tomorrow.
$ echo Escherichia coli | taxonkit name2taxid --data-dir gtdb.r207.taxdump
Escherichia coli 4093283224
$ echo 4093283224 | taxonkit lineage --data-dir gtdb.r207.taxdump -r
4093283224 Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli species
$ taxonkit list --data-dir gtdb.r207.taxdump --ids 4093283224 -r -n | head -n 5
4093283224 [species] Escherichia coli
61542 [no rank] GCA_000176655
167432 [no rank] GCF_002458255
477667 [no rank] GCA_000193895
502475 [no rank] GCF_910592735
Though merged.dmp
and delnodes.dmp
are not available right now, we can still generate taxid-changelog:
$ taxonkit taxid-changelog -i . -o gtdb-taxid-changelog.csv.gz --verbose
Let's see an Escherichia flexneri;
assembly which should be merged into the Escherichia coli
species (https://forum.gtdb.ecogenomic.org/t/announcing-gtdb-r07-rs207/264).
$ zcat gtdb-taxid-changelog.csv.gz | csvtk grep -f taxid -p 167432
167432,gtdb.r202.taxdump,NEW,,GCF_002458255,no rank,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia flexneri;GCF_002458255,609216830;3788559933;329474883;3160438580;2234733759;3334977531;3912920909;167432
167432,gtdb.r207.taxdump,CHANGE_LIN_TAX,,GCF_002458255,no rank,Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia;Escherichia coli;GCF_002458255,609216830;3788559933;329474883;3160438580;2234733759;3334977531;4093283224;167432
So here, 3912920909
is merged into 4093283224
.
I think I did it. Please visit: https://github.com/shenwei356/gtdb-taxdump
Please try the new version: https://github.com/shenwei356/taxonkit/releases/tag/v0.11.0-alpha
Is it possible to create a "custom" taxdump where the taxids are strings and the taxonomy info includes just class, order, family, genus, and species?
taxonomy info includes just class, order, family, genus, and species
You can define whatever rank you want.
taxdump where the taxids are strings
That would not be the taxdump files. Sounds like GTDB taxonomy file? You might replace the "taxid" with something else?
I've combined a bunch of different databases together but some do not have ncbi_taxid
fields. Here's an example of what my table looks like:
id_source dataset ncbi_taxid lineage class order family genus species strain notes resolved_lineage
Aalte1 MycoCosm 5599 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata Dothideomycetes Pleosporales Pleosporaceae Alternaria Alternaria alternata True
Aaoar1 MycoCosm 1450171 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii DothideomycetesPleosporales Aaosphaeria Aaosphaeria arxii False
Abobi1 MycoCosm 137743 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis True
Abobie1 MycoCosm 137743 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis True
Abscae1 MycoCosm 90261 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia caerulea True
Absrep1 MycoCosm 90262 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia repens True
Acain1 MycoCosm 215250 d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii Exobasidiomycetes Exobasidiales Cryptobasidiaceae Acaromyces Acaromyces ingoldii True
Acastr1 MycoCosm 1307806 d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata Lecanoromycetes Acarosporales Acarosporaceae Acarospora Acarospora strigata True
Acema1 MycoCosm 886606 d__Eukaryota;p__Ascomycota;c__Leotiomycetes;o__Helotiales;f__Mollisiaceae;g__Acephala;s__Acephala macrosclerotiorum Leotiomycetes Helotiales Mollisiaceae Acephala Acephala macrosclerotiorum True
Unfortunately, some have missing fields for different taxonomic levels.
some do not have ncbi_taxid fields
It's not a problem.
Unfortunately, some have missing fields for different taxonomic levels.
It's OK. See the last example.
$ cat data.tsv \
| csvtk fix -t \
| csvtk cut -t -f class,order,family,genus,species \
| csvtk pretty -t
class order family genus species
--------------------------- ------------- ------------------ ----------------- --------------------------
Dothideomycetes Pleosporales Pleosporaceae Alternaria Alternaria alternata
DothideomycetesPleosporales Aaosphaeria Aaosphaeria arxii
Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis
Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis
Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia caerulea
Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia repens
Exobasidiomycetes Exobasidiales Cryptobasidiaceae Acaromyces Acaromyces ingoldii
Lecanoromycetes Acarosporales Acarosporaceae Acarospora Acarospora strigata
Leotiomycetes Helotiales Mollisiaceae Acephala Acephala macrosclerotiorum
$ cat data.tsv \
| csvtk fix -t \
| csvtk cut -t -f species \
| taxonkit --data-dir test name2taxid \
| taxonkit --data-dir test lineage -r -i 2
| csvtk rename -t -f 1-4 -n query,taxid,lineage,rank \
| csvtk pretty -t -W 40 -x ';'
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┓
┃ query ┃ taxid ┃ lineage ┃ rank ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Alternaria alternata ┃ 1551135023 ┃ Dothideomycetes;Pleosporales; ┃ species ┃
┃ ┃ ┃ Pleosporaceae;Alternaria; ┃ ┃
┃ ┃ ┃ Alternaria alternata ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Abortiporus biennis ┃ 1095247865 ┃ Agaricomycetes;Polyporales; ┃ species ┃
┃ ┃ ┃ Podoscyphaceae;Abortiporus; ┃ ┃
┃ ┃ ┃ Abortiporus biennis ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Abortiporus biennis ┃ 1095247865 ┃ Agaricomycetes;Polyporales; ┃ species ┃
┃ ┃ ┃ Podoscyphaceae;Abortiporus; ┃ ┃
┃ ┃ ┃ Abortiporus biennis ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Absidia caerulea ┃ 2044322620 ┃ Mucoromycetes;Mucorales; ┃ species ┃
┃ ┃ ┃ Cunninghamellaceae;Absidia; ┃ ┃
┃ ┃ ┃ Absidia caerulea ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Absidia repens ┃ 2093321634 ┃ Mucoromycetes;Mucorales; ┃ species ┃
┃ ┃ ┃ Cunninghamellaceae;Absidia; ┃ ┃
┃ ┃ ┃ Absidia repens ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acaromyces ingoldii ┃ 1276596547 ┃ Exobasidiomycetes;Exobasidiales; ┃ species ┃
┃ ┃ ┃ Cryptobasidiaceae;Acaromyces; ┃ ┃
┃ ┃ ┃ Acaromyces ingoldii ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acarospora strigata ┃ 643580302 ┃ Lecanoromycetes;Acarosporales; ┃ species ┃
┃ ┃ ┃ Acarosporaceae;Acarospora; ┃ ┃
┃ ┃ ┃ Acarospora strigata ┃ ┃
┣━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━━━━╋━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╋━━━━━━━━━┫
┃ Acephala macrosclerotiorum ┃ 1403990922 ┃ Leotiomycetes;Helotiales;Mollisiaceae; ┃ species ┃
┃ ┃ ┃ Acephala;Acephala macrosclerotiorum ┃ ┃
┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┻━━━━━━━━━┛
$ echo Aaosphaeria arxii \
| taxonkit --data-dir test name2taxid \
| taxonkit --data-dir test reformat -I 2 -f '{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}' \
| csvtk add-header -t -n query,taxid,kindom,phylum,class,order,family,genus,species \
| csvtk pretty -t
query taxid kindom phylum class order family genus species
----------------- ---------- ------ ------ --------------------------- ----- ----------- ----------------- -------
Aaosphaeria arxii 1933378114 DothideomycetesPleosporales Aaosphaeria Aaosphaeria arxii
Cheers! :beers:
Am I doing this correctly?
Here is test.tsv
:
gzip -d -c source_taxonomy.tsv.gz | head -n 10 > test.tsv
id_source dataset ncbi_taxid lineage class order family genus species strain notes resolved_lineage
Aalte1 MycoCosm 5599 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata Dothideomycetes Pleosporales Pleosporaceae Alternaria Alternaria alternata True
Aaoar1 MycoCosm 1450171 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii Dothideomycetes Pleosporales Aaosphaeria Aaosphaeria arxii False
Abobi1 MycoCosm 137743 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis True
Abobie1 MycoCosm 137743 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis Agaricomycetes Polyporales Podoscyphaceae Abortiporus Abortiporus biennis True
Abscae1 MycoCosm 90261 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia caerulea True
Absrep1 MycoCosm 90262 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens Mucoromycetes Mucorales Cunninghamellaceae Absidia Absidia repens True
Acain1 MycoCosm 215250 d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii Exobasidiomycetes Exobasidiales Cryptobasidiaceae Acaromyces Acaromyces ingoldii True
acanthamoeba_castellanii_str_neff_gca_000313135 EnsemblProtists 1257118 d__Eukaryota;p__Discosea;c__;o__Longamoebia;f__Acanthamoebidae;g__Acanthamoeba;s__Acanthamoeba castellanii Longamoebia Acanthamoebidae Acanthamoeba Acanthamoeba castellanii Acanthamoeba castellanii str. Neff False
Acastr1 MycoCosm 1307806 d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata Lecanoromycetes Acarosporales Acarosporaceae Acarospora Acarospora strigata True
Now remove header, get source name and lineage, then pipe into taxonkit create-taxdump
:
Aalte1 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__Pleosporaceae;g__Alternaria;s__Alternaria alternata
Aaoar1 d__Eukaryota;p__Ascomycota;c__Dothideomycetes;o__Pleosporales;f__;g__Aaosphaeria;s__Aaosphaeria arxii
Abobi1 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis
Abobie1 d__Eukaryota;p__Basidiomycota;c__Agaricomycetes;o__Polyporales;f__Podoscyphaceae;g__Abortiporus;s__Abortiporus biennis
Abscae1 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia caerulea
Absrep1 d__Eukaryota;p__Mucoromycota;c__Mucoromycetes;o__Mucorales;f__Cunninghamellaceae;g__Absidia;s__Absidia repens
Acain1 d__Eukaryota;p__Basidiomycota;c__Exobasidiomycetes;o__Exobasidiales;f__Cryptobasidiaceae;g__Acaromyces;s__Acaromyces ingoldii
acanthamoeba_castellanii_str_neff_gca_000313135 d__Eukaryota;p__Discosea;c__;o__Longamoebia;f__Acanthamoebidae;g__Acanthamoeba;s__Acanthamoeba castellanii
Acastr1 d__Eukaryota;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora strigata
Piping the above into taxonkit:
cat test.tsv | tail -n +2 | cut -f1,4 | taxonkit create-taxdump --gtdb -O test_output/
17:37:32.975 [WARN] --gtdb-re-subs failed to extract ID for subspecies, the origninal value is used instead. e.g., Acastr1
17:37:32.978 [INFO] 9 records saved to test_output/taxid.map
17:37:32.978 [INFO] 47 records saved to test_output/nodes.dmp
17:37:32.979 [INFO] 47 records saved to test_output/names.dmp
17:37:32.979 [INFO] 0 records saved to test_output/merged.dmp
17:37:32.979 [INFO] 0 records saved to test_output/delnodes.dmp
Is this the correct usage?
When I try it with the full dataset, i get this error:
gzip -d -c source_taxonomy.tsv.gz | tail -n +2 | cut -f1,4 | taxonkit create-taxdump --gtdb -O taxdump/
17:38:53.517 [ERRO] invalid GTDB taxonomy record: MicG_I_3
Well, it becomes a little bit complex.
It seems that the line containing MicG_I_3
is not in the format of d_xx;p_xxx
. Is it from other sources?
At the first glance of the input data, I thought it had complete information of all ranks. But it seems that domain and phylum are missing.
$ csvtk headers -t t.tsv
id_source
dataset
ncbi_taxid
lineage
class
order
family
genus
species
strain
notes
resolved_lineage
For rows with lineage in format of "d_xxx,p_xxx", domain and phylum could be extracted. So you can use code I pasted below. If the line of MicG_I_3
do not have enough lineage information, it's bad.
Damn, looks like that one is missing a lineage field.
What would the command be if I had the following columns:
id_source, class, order, family, genus, species, strain (some fields might be empty here but not all fields)
gzip -d -c source_taxonomy.tsv.gz | tail -n +2 | cut -f1,5,6,7,8,9,10 | taxonkit create-taxdump
Would -A 0
and -R
be the next args?
zcat source_taxonomy.tsv.gz \
| csvtk fix -t \
| csvtk cut -t -f id_source,class,order,family,genus,species,strain \
| taxonkit create-taxdump -A 1 -O taxdump
Do I need to specify to create-taxdump what the class, order, family, genus, species, and strain columns are or does it autodetect?
https://bioinf.shenwei.me/taxonkit/usage/#create-taxdump
Please check the usage and example 4.
Input formats:
2. Ranks can be given either via the first row or the flag --rank-names.
Flags:
-R, --rank-names strings names of all ranks, leave it empty to use the first row of input as
rank names
It's similar to gtdb_to_taxdump, but more generalized to support MGV.
The input would be: