shenwei356 / gtdb-taxdump

GTDB taxonomy taxdump files with trackable TaxIds
MIT License
49 stars 2 forks source link

[Question] GTDB plus NCBI kraken2 database building using GTDB grafted NCBI taxonomy #6

Open zztin opened 1 year ago

zztin commented 1 year ago

Hi Shenwei,

We are exploring building a database where we can include bacteria and archaea (from GTDB) and viral database from NCBI, while using NCBI style taxonomy but grafting the GTDB taxonomy for the bacteria and archaea level. Do you know how can I achieve this with gtdb-taxdump?

Flextaxd is another tool promising a similar effect. However, we have already downloaded Struo2 r207 genomes and taxdump. I think we might be better off using your tool to combine with NCBI taxonomy instead of flextaxd? Or did I not grasp the idea completely.

shenwei356 commented 1 year ago

Someone had the same need before, and I write these steps and will add them to the doc of TaxonKit.

Merging GTDB and NCBI taxonomy

Sometimes (1) one needs to build a database including bacteria and archaea (from GTDB) and viral database from NCBI. The idea is to export lineages from both GTDB and NCBI using taxonkit reformat, and then create taxdump files from them with taxonkit create-taxdump.

  1. Exporting taxonomic lineages of taxa with rank equal to species from GTDB-taxdump.

    taxonkit list --data-dir gtdb-taxdump/R207/ --ids 1 --indent "" \
        | taxonkit filter --data-dir gtdb-taxdump/R207/ --equal-to species \
        | taxonkit reformat --data-dir gtdb-taxdump/R207/ --taxid-field 1 \
            --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}" \
            -o gtdb.tsv
  2. Exporting taxonomic lineages of viral taxa with rank equal to or lower than species from NCBI taxdump. Since the lineage of many viruses are incomplete, we fill in missing taxa, e.g., "327830 Viruses;Polydnaviriformidae;Ichnoviriform;unclassified Ichnovirus" -> "Viruses;unclassified Viruses phylum;unclassified Viruses class;unclassified Viruses order;Polydnaviriformidae;Ichnoviriform;unclassified Ichnoviriform species;unclassified Ichnovirus".

    # taxid of Viruses: 10239
    taxonkit list --data-dir ~/.taxonkit --ids 10239 --indent "" \
        | taxonkit filter --data-dir ~/.taxonkit --equal-to species --lower-than species \
        | taxonkit reformat --data-dir ~/.taxonkit --taxid-field 1 \
            --fill-miss-rank --pseudo-strain --format "{k}\t{p}\t{c}\t{o}\t{f}\t{g}\t{s}\t{t}" \
            -o ncbi-viral.tsv
  3. Creating taxdump from lineages above.

    (awk '{print $_"\t"}' gtdb.tsv; cat ncbi-viral.tsv) \
        | taxonkit create-taxdump \
            --field-accession 1 \
            -R "superkingdom,phylum,class,order,family,genus,species,strain" \
            -O taxdump

    we use --field-accession 1 to output the mapping file between old taxids and new ones.

    $ grep 2697049 taxdump/taxid.map # SARS-COV-2 2697049 21630522

Some tests

taxdump.tar.gz

#  # SARS-COV-2 in NCBI taxonomy
$ echo 2697049 \
    | taxonkit lineage -t --data-dir ~/.taxonkit \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir ~/.taxonkit \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
10239     superkingdom   Viruses
2559587   clade          Riboviria
2732396   kingdom        Orthornavirae
2732408   phylum         Pisuviricota
2732506   class          Pisoniviricetes
76804     order          Nidovirales
2499399   suborder       Cornidovirineae
11118     family         Coronaviridae
2501931   subfamily      Orthocoronavirinae
694002    genus          Betacoronavirus
2509511   subgenus       Sarbecovirus
694009    species        Severe acute respiratory syndrome-related coronavirus
2697049   no rank        Severe acute respiratory syndrome coronavirus 2

$ echo "Severe acute respiratory syndrome coronavirus 2" | taxonkit name2taxid --data-dir taxdump/
Severe acute respiratory syndrome coronavirus 2 216305222

$ echo 216305222 \
    | taxonkit lineage -t --data-dir taxdump/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
1287770734   superkingdom   Viruses
1506901452   phylum         Pisuviricota
1091693597   class          Pisoniviricetes
37745009     order          Nidovirales
738421640    family         Coronaviridae
906833049    genus          Betacoronavirus
1015862491   species        Severe acute respiratory syndrome-related coronavirus
216305222    strain         Severe acute respiratory syndrome coronavirus 2

$ echo "Escherichia coli"  | taxonkit name2taxid --data-dir taxdump/
Escherichia coli        1945799576

$ echo 1945799576 \
    | taxonkit lineage -t --data-dir taxdump/ \
    | csvtk cut -Ht -f 3 \
    | csvtk unfold -Ht -f 1 -s ";" \
    | taxonkit lineage -r -n -L --data-dir taxdump/ \
    | csvtk cut -Ht -f 1,3,2 \
    | csvtk pretty -Ht
609216830    superkingdom   Bacteria
1641076285   phylum         Proteobacteria
329474883    class          Gammaproteobacteria
1012954932   order          Enterobacterales
87250111     family         Enterobacteriaceae
1187493883   genus          Escherichia
1945799576   species        Escherichia coli
zztin commented 1 year ago

Hi Shenwei,

Great documentation! Thank you for your time!

I have 2 related questions:

  1. I did not completely describe my question. We also like to integrate human, plasmid, and UniVec_Core sequences. Since Bacteria and Achaea and Virus example above was cut at Domain(D) level, how would you suggest to integrate the upper levels as listed below? From kraken2-inspect file by grep the text, I got the following taxids:

    taxid of (U) unclassified: 0 
    taxid of (R) root : 1
    taxid of (D) Eukaryota: 2759
    taxid of (R1) other entries: 2787854
    taxid of (R2) other sequences: 28384
    taxid of (R3) plasmids:  36549
    (other entries -> other sequences -> plasmids)
    All UniVec_Core sequences are assigned as other sequences. taxid = 28384 
  2. For clarity, following step3 in the tutorial if I continue to build the kraken2 database. Is the following steps sufficient? a. unzip the taxdump file and locate these files in the $DBNAME/taxonomy folder b. move all library.fna files (including GTDB downloaded from Struo2, and libraries downloaded by kraken2-build --download-library, and customized genome if existed in the $DBNAME/library folder c. build the kraken2 db by kraken2-build build --db $DBNAME

Especially for the a. part. Is the taxdump file containing all files needed for taxonomy folder? For a standard download with kraken2-build --download-taxonomy, the file list I got are:

taxdump.untarflag
taxdump.dlflag
taxdump.tar.gz
accmap.dlflag
nucl_wgs.accession2taxid
nucl_gb.accession2taxid
gc.prt
citations.dmp
names.dmp
nodes.dmp
delnodes.dmp
merged.dmp
division.dmp
gencode.dmp
readme.txt

Thank you very much. Best, Li-Ting

shenwei356 commented 1 year ago

I did not completely describe my question. We also like to integrate human, plasmid, and UniVec_Core sequences. Since Bacteria and Achaea and Virus example above was cut at Domain(D) level, how would you suggest to integrate the upper levels as listed below?

Whatever you've got, you can follow the steps above to 1) export/create complete lineages and 2) create taxdump files from them.

For clarity, following step3 in the tutorial if I continue to build the kraken2 database. Is the following steps sufficient? a. unzip the taxdump file and locate these files in the $DBNAME/taxonomy folder b. move all library.fna files (including GTDB downloaded from Struo2, and libraries downloaded by kraken2-build --download-library, and customized genome if existed in the $DBNAME/library folder c. build the kraken2 db by kraken2-build build --db $DBNAME

Here are some steps for Custom database for Kraken and Bracken you can learn from. Just prepare the files in the required format that Kraken2 wants.

Especially for the a. part. Is the taxdump file containing all files needed for taxonomy folder? For a standard download with kraken2-build --download-taxonomy, the file list I got are:

Essential taxdump files are:

names.dmp
nodes.dmp
delnodes.dmp
merged.dmp

The accession2taxid files map accession (only these in RefSeq/Genbank) to taxid, which is optional if you format the FASTA IDs for Kraken by yourself. For the sequences from GTDB, the mapping relationship is provided by gtdb-taxdump/R207/taxid.map.

$ head -n 1  gtdb-taxdump/R207/taxid.map
GCF_000979375.1 1349515035

For viral sequences, it requires an extra step. The relationship is accession -> NCBI taxid -> New TaxId created by taxonkit.

accession -> NCBI taxid           # provided by  nucl_gb.accession2taxid
NCBI taxid -> New  TaxId created by  taxonkit       # available in taxid.map generated with taxonkit with the steps above, but the order is reversed:  'New  TaxId created by  taxonkit' -> 'NCBI taxid'.

What I can provide is the way to generate taxdump files that combine GTDB taxonomy and NCBI taxonomy, and taxid.map file that maps accession (whatever you provide, you can use the sequence accession) to TaxIds.

Good luck!