shenwei356 / kmcp

Accurate metagenomic profiling && Fast large-scale sequence/genome searching
https://bioinf.shenwei.me/kmcp
MIT License
176 stars 13 forks source link

TODO: support viral genomes collections, e.g, MGV, GPD #11

Closed shenwei356 closed 1 year ago

shenwei356 commented 2 years ago

Genomes in MGV and GPD were assembled from shotgun metagenomic data (MAG). Though they are clustered into species, there are no official TaxIds available to show their relationship.

A new command similar to gtdb_to_taxdump is needed. Let TaxonKit do it!

shenwei356 commented 2 years ago

GTDB taxonomy taxdump files with trackable TaxIds https://github.com/shenwei356/gtdb-taxdump

shenwei356 commented 2 years ago

MGV is also supported.

$ cat mgv_contig_info.tsv \
    | csvtk cut -t -f contig_id,votu_id,ictv_order,ictv_family,ictv_genus \
    | sed 1d \
    > mgv.tsv

$ taxonkit create-taxdump mgv.tsv --out-dir mgv --force -A 1 -S 2 -O 3 -F 4 -G 5
16:45:40.555 [WARN] --field-accession-re failed to extract genome accession, the origninal value is used instead. e.g., MGV-GENOME-0231225
16:45:40.817 [INFO] 189680 records saved to mgv/taxid.map
16:45:40.846 [INFO] 54224 records saved to mgv/nodes.dmp
16:45:40.864 [INFO] 54224 records saved to mgv/names.dmp
16:45:40.864 [INFO] 0 records saved to mgv/merged.dmp
16:45:40.864 [INFO] 0 records saved to mgv/delnodes.dmp

$ head -n 5 mgv/taxid.map 
MGV-GENOME-0364295      677052301
MGV-GENOME-0364296      677052301
MGV-GENOME-0364303      1414406025
MGV-GENOME-0364311      1849074420
MGV-GENOME-0364312      2074846424

$ echo 677052301 | taxonkit lineage --data-dir mgv/ 
677052301       Caudovirales;crAss-phage;OTU-61123

$ echo 677052301 | taxonkit reformat --data-dir mgv/ -I 1 -P
677052301       k__;p__;c__;o__Caudovirales;f__crAss-phage;g__;s__OTU-61123

$ csvtk grep -Ht -f 1 -p MGV-GENOME-0364295 mgv.tsv 
MGV-GENOME-0364295      OTU-61123       Caudovirales    crAss-phage     NULL
shenwei356 commented 2 years ago

The doc added: https://bioinf.shenwei.me/kmcp/database/#mgv-54118-species