Closed shenwei356 closed 2 years ago
GTDB taxonomy taxdump files with trackable TaxIds https://github.com/shenwei356/gtdb-taxdump
MGV is also supported.
$ cat mgv_contig_info.tsv \
| csvtk cut -t -f contig_id,votu_id,ictv_order,ictv_family,ictv_genus \
| sed 1d \
> mgv.tsv
$ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 1 -S 2 -O 3 -F 4 -G 5
16:45:40.555 [WARN] --field-accession-re failed to extract genome accession, the origninal value is used instead. e.g., MGV-GENOME-0231225
16:45:40.817 [INFO] 189680 records saved to mgv/taxid.map
16:45:40.846 [INFO] 54224 records saved to mgv/nodes.dmp
16:45:40.864 [INFO] 54224 records saved to mgv/names.dmp
16:45:40.864 [INFO] 0 records saved to mgv/merged.dmp
16:45:40.864 [INFO] 0 records saved to mgv/delnodes.dmp
$ head -n 5 mgv/taxid.map
MGV-GENOME-0364295 677052301
MGV-GENOME-0364296 677052301
MGV-GENOME-0364303 1414406025
MGV-GENOME-0364311 1849074420
MGV-GENOME-0364312 2074846424
$ echo 677052301 | taxonkit lineage --data-dir mgv/
677052301 Caudovirales;crAss-phage;OTU-61123
$ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P
677052301 k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123
$ csvtk grep -Ht -f 1 -p MGV-GENOME-0364295 mgv.tsv
MGV-GENOME-0364295 OTU-61123 Caudovirales crAss-phage NULL
The doc added: https://bioinf.shenwei.me/kmcp/database/#mgv-54118-species
Genomes in MGV and GPD were assembled from shotgun metagenomic data (MAG). Though they are clustered into species, there are no official TaxIds available to show their relationship.
A new command similar to gtdb_to_taxdump is needed. Let TaxonKit do it!