Open zztin opened 1 year ago
Someone had the same need before, and I write these steps and will add them to the doc of TaxonKit.
Sometimes (1) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.
Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump.
taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent "" \
| taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \
| taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \
--format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
-o gtdb.tsv
Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. Since the lineage of many viruses are incomplete, we fill in missing taxa, e.g., "327830 Viruses;Polydnaviriformidae;Ichnoviriform;unclassified Ichnovirus" -> "Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Polydnaviriformidae;Ichnoviriform;unclassified Ichnoviriform species;unclassified Ichnovirus".
# taxid of Viruses: 10239
taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent "" \
| taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \
| taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \
--fill-miss-rank --pseudo-strain --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
-o ncbi-viral.tsv
Creating taxdump from lineages above.
(awk '{print $_"\t"}' gtdb.tsv; cat ncbi-viral.tsv) \
| taxonkit create-taxdump \
--field-accession 1 \
-R "superkingdom,phylum,class,order,family,genus,species,strain" \
-O taxdump
$ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522
# # SARS-COV-2 in NCBI taxonomy
$ echo 2697049 \
| taxonkit lineage -t --data-dir ~/.taxonkit \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L --data-dir ~/.taxonkit \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
10239 superkingdom Viruses
2559587 clade Riboviria
2732396 kingdom Orthornavirae
2732408 phylum Pisuviricota
2732506 class Pisoniviricetes
76804 order Nidovirales
2499399 suborder Cornidovirineae
11118 family Coronaviridae
2501931 subfamily Orthocoronavirinae
694002 genus Betacoronavirus
2509511 subgenus Sarbecovirus
694009 species Severe acute respiratory syndrome-related coronavirus
2697049 no rank Severe acute respiratory syndrome coronavirus 2
$ echo "Severe acute respiratory syndrome coronavirus 2" | taxonkit name2taxid --data-dir taxdump/
Severe acute respiratory syndrome coronavirus 2 216305222
$ echo 216305222 \
| taxonkit lineage -t --data-dir taxdump/ \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L --data-dir taxdump/ \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
1287770734 superkingdom Viruses
1506901452 phylum Pisuviricota
1091693597 class Pisoniviricetes
37745009 order Nidovirales
738421640 family Coronaviridae
906833049 genus Betacoronavirus
1015862491 species Severe acute respiratory syndrome-related coronavirus
216305222 strain Severe acute respiratory syndrome coronavirus 2
$ echo "Escherichia coli" | taxonkit name2taxid --data-dir taxdump/
Escherichia coli 1945799576
$ echo 1945799576 \
| taxonkit lineage -t --data-dir taxdump/ \
| csvtk cut -Ht -f 3 \
| csvtk unfold -Ht -f 1 -s ";" \
| taxonkit lineage -r -n -L --data-dir taxdump/ \
| csvtk cut -Ht -f 1,3,2 \
| csvtk pretty -Ht
609216830 superkingdom Bacteria
1641076285 phylum Proteobacteria
329474883 class Gammaproteobacteria
1012954932 order Enterobacterales
87250111 family Enterobacteriaceae
1187493883 genus Escherichia
1945799576 species Escherichia coli
Hi Shenwei,
Great documentation! Thank you for your time!
I have 2 related questions:
I did not completely describe my question. We also like to integrate human, plasmid, and UniVec_Core sequences. Since Bacteria and Achaea and Virus example above was cut at Domain(D) level, how would you suggest to integrate the upper levels as listed below? From kraken2-inspect file by grep the text, I got the following taxids:
taxid of (U) unclassified: 0
taxid of (R) root : 1
taxid of (D) Eukaryota: 2759
taxid of (R1) other entries: 2787854
taxid of (R2) other sequences: 28384
taxid of (R3) plasmids: 36549
(other entries -> other sequences -> plasmids)
All UniVec_Core sequences are assigned as other sequences. taxid = 28384
For clarity, following step3 in the tutorial if I continue to build the kraken2 database. Is the following steps sufficient? a. unzip the taxdump file and locate these files in the $DBNAME/taxonomy folder b. move all library.fna files (including GTDB downloaded from Struo2, and libraries downloaded by kraken2-build --download-library, and customized genome if existed in the $DBNAME/library folder c. build the kraken2 db by kraken2-build build --db $DBNAME
Especially for the a. part. Is the taxdump file containing all files needed for taxonomy folder? For a standard download with kraken2-build --download-taxonomy, the file list I got are:
taxdump.untarflag
taxdump.dlflag
taxdump.tar.gz
accmap.dlflag
nucl_wgs.accession2taxid
nucl_gb.accession2taxid
gc.prt
citations.dmp
names.dmp
nodes.dmp
delnodes.dmp
merged.dmp
division.dmp
gencode.dmp
readme.txt
Thank you very much. Best, Li-Ting
I did not completely describe my question. We also like to integrate human, plasmid, and UniVec_Core sequences. Since Bacteria and Achaea and Virus example above was cut at Domain(D) level, how would you suggest to integrate the upper levels as listed below?
Whatever you've got, you can follow the steps above to 1) export/create complete lineages and 2) create taxdump files from them.
For clarity, following step3 in the tutorial if I continue to build the kraken2 database. Is the following steps sufficient? a. unzip the taxdump file and locate these files in the $DBNAME/taxonomy folder b. move all library.fna files (including GTDB downloaded from Struo2, and libraries downloaded by kraken2-build --download-library, and customized genome if existed in the $DBNAME/library folder c. build the kraken2 db by kraken2-build build --db $DBNAME
Here are some steps for Custom database for Kraken and Bracken you can learn from. Just prepare the files in the required format that Kraken2 wants.
Especially for the a. part. Is the taxdump file containing all files needed for taxonomy folder? For a standard download with kraken2-build --download-taxonomy, the file list I got are:
Essential taxdump files are:
names.dmp
nodes.dmp
delnodes.dmp
merged.dmp
The accession2taxid files map accession (only these in RefSeq/Genbank) to taxid, which is optional if you format the FASTA IDs for Kraken by yourself. For the sequences from GTDB, the mapping relationship is provided by gtdb-taxdump/R207/taxid.map
.
$ head -n 1 gtdb-taxdump/R207/taxid.map
GCF_000979375.1 1349515035
For viral sequences, it requires an extra step. The relationship is accession
-> NCBI taxid
-> New TaxId created by taxonkit
.
accession -> NCBI taxid # provided by nucl_gb.accession2taxid
NCBI taxid -> New TaxId created by taxonkit # available in taxid.map generated with taxonkit with the steps above, but the order is reversed: 'New TaxId created by taxonkit' -> 'NCBI taxid'.
What I can provide is the way to generate taxdump files that combine GTDB taxonomy and NCBI taxonomy, and taxid.map file that maps accession (whatever you provide, you can use the sequence accession) to TaxIds.
Good luck!
Hi Shenwei,
We are exploring building a database where we can include bacteria and archaea (from GTDB) and viral database from NCBI, while using NCBI style taxonomy but grafting the GTDB taxonomy for the bacteria and archaea level. Do you know how can I achieve this with gtdb-taxdump?
Flextaxd is another tool promising a similar effect. However, we have already downloaded Struo2 r207 genomes and taxdump. I think we might be better off using your tool to combine with NCBI taxonomy instead of flextaxd? Or did I not grasp the idea completely.