shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
357 stars 29 forks source link

Create-taxdump not generating subspecies #67

Closed Lucas-Maciel closed 1 year ago

Lucas-Maciel commented 1 year ago

Hi,

I'm trying to use taxonkit create-taxdump but I have two questions:

1) I'm using the following command but all my "accession" names are being assigned as "no rank " instead of subspecies. Am I missing something?

$ taxonkit create-taxdump class.gtdb.tsv --field-accession-as-subspecies --gtdb --out-dir taxdump/
08:18:38.735 [WARN] --gtdb-re-subs failed to extract ID for subspecies, the original value is used instead. e.g., HumGut_30691
08:18:38.971 [INFO] 32264 records saved to taxdump/taxid.map
08:18:39.521 [INFO] 37770 records saved to taxdump/nodes.dmp
08:18:39.861 [INFO] 37770 records saved to taxdump/names.dmp
08:18:39.884 [INFO] 0 records saved to taxdump/merged.dmp
08:18:39.884 [INFO] 0 records saved to taxdump/delnodes.dmp

My input has the following format

MGG00015        d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__COE1;s__COE1 sp002358575
MGG00050        d__Bacteria;p__Firmicutes_A;c__Clostridia_A;o__Christensenellales;f__CAG-552;g__MGG03569;s__MGG03569 MGG00050

2) Do you have any tips on how to safely integrate this taxdump file with the one provided by NCBI? I want for example to use this custom GTDB taxdump together with the NCBI viral and fungi database from Kraken2. But I'm worried about the conflicts between taxid numbers.

Thank you for your time Kind regards

shenwei356 commented 1 year ago

https://github.com/shenwei356/gtdb-taxdump#taxonomic-hierarchy

A GTDB species cluster contains >=1 assemblies, each can be treated as a strain. So we can assign each assembly a TaxId with the rank of "no rank" below the species rank. Therefore, we can also track the changes of these assemblies via the TaxId later.

Don't worry the "no rank" which is below the species rank, so it belongs to "subspecies".

609216830    superkingdom   Bacteria
947989846    phylum         Firmicutes_A
1797966051   class          Clostridia
1853814285   order          Lachnospirales
3217231047   family         Lachnospiraceae
1880979389   genus          COE1
2414110737   species        COE1 sp002358575
2538223356   no rank        MGG00015
shenwei356 commented 1 year ago

Do you have any tips on how to safely integrate this taxdump file with the one provided by NCBI? I want for example to use this custom GTDB taxdump together with the NCBI viral and fungi database from Kraken2. But I'm worried about the conflicts between taxid numbers.

It's a great idea. I think my taxonomic profiling tool KMCP should also use this combined taxonomy. Previsouly, we use the NCBI taxonomy for reference genomes from GTDB and Refseq.

To achieve this, you need to create taxdump files with both the GTDB lineages and NCBI lineages of the viral and fungi in one run.

  1. Get the 7-rank lineages of viral and fungi taxa with taxonkit list | taxonkit reformat.
  2. Samely, get the 7-rank lineages of GTDB, either by directly reformating the GTDB taxonomy format or taxonkit create-taxdump --gtdb and taxonkit list | taxonkit reformat.
  3. Simply concatenate all the lineages and call taxonkit create-taxdump.

I'll add the steps to the tutorial, maybe next week (We're on holiday).

Lucas-Maciel commented 1 year ago

@shenwei356 thank you for your reply.

I'll try your instructions and check the KMCP as well.

Best,Lucas

shenwei356 commented 1 year ago

I've added some tutorials on Merging GTDB and NCBI taxonomy, which could help.