shenwei356 / taxonkit

A Practical and Efficient NCBI Taxonomy Toolkit, also supports creating NCBI-style taxdump files for custom taxonomies like GTDB/ICTV
https://bioinf.shenwei.me/taxonkit
MIT License
378 stars 30 forks source link

Error with makeblastdb using GTDB taxonkit create-taxdump taxid.map #70

Closed BenjaminJPerry closed 1 year ago

BenjaminJPerry commented 1 year ago

Hello Wei Shen,

This is not strictly an error with taxonkit create-taxdump, but more of a feature request?

I'm trying to use the taxid.map generated using taxonkit create-taxdump for the GTDB database (r207) when making a blastn database of the complete set of GTDB representative genomes (r207).

Making the taxdump using taxonkit,

(/home/perrybe/conda-envs/taxonkit) inscrutable$ taxonkit --help | head
TaxonKit - A Practical and Efficient NCBI Taxonomy Toolkit

Version: 0.13.0

Author: Wei Shen <shenwei356@gmail.com>

Source code: https://github.com/shenwei356/taxonkit
Documents  : https://bioinf.shenwei.me/taxonkit
Citation   : https://www.sciencedirect.com/science/article/pii/S1673852721000837

(/home/perrybe/conda-envs/taxonkit) inscrutable$ taxonkit create-taxdump --gtdb --force ar53_taxonomy.tsv bac120_taxonomy.tsv -O ./
09:02:31.932 [INFO] 317542 records saved to taxid.map
09:02:32.366 [INFO] 401815 records saved to nodes.dmp
09:02:32.642 [INFO] 401815 records saved to names.dmp
09:02:32.644 [INFO] 0 records saved to merged.dmp
09:02:32.644 [INFO] 0 records saved to delnodes.dmp

Using it to make the blast database (where the error occurs),

(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ makeblastdb -version
makeblastdb: 2.9.0+
 Package: blast 2.9.0, build May 31 2019 20:53:30

(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ makeblastdb -in GTDB-latest.fna -dbtype nucl -parse_seqids -taxid_map taxid.map

Building a new DB, current time: 11/24/2022 08:56:25
New DB name:   /bifo/scratch/2022-BJP-GTDB_Benchmarking/gtdb-latest/GTDB-latest.fna
New DB title:  GTDB-latest.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 1000000000B
Error: NCBI C++ Exception:
    T0 "/opt/conda/conda-bld/blast_1559335677723/work/c++/src/corelib/ncbistr.cpp", line 578: Error: ncbi::NStr::StringToInt() - Cannot convert string '2988443261' to int, overflow (m_Pos = 0)

In the taxid.map generated with taxonkit ,

(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ cat taxid.map | wc -l
317542
(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ cat taxid.map | grep -n "2988443261"
4:GCF_000980105.1       2988443261
(/dataset/bioinformatics_dev/active/conda-env/blast2.9) inscrutable$ head taxid.map
GCF_000979375.1 1349515035
GCF_000970165.1 1457399847
GCF_000979555.1 732503645
GCF_000980105.1 2988443261 <---
GCF_000007065.1 369781300
GCF_000980175.1 148096987
GCF_000970205.1 3005035806
GCA_002506415.1 1847834409
GCF_000970245.1 977990156
GCF_000970185.1 3122581739

It seems like the size of the value is too large for makeblastdb to handle when building?

It may be more of an issue with makeblastdb, but I thought I would pass it on as it might be an easy fix in taxonkit 😋

Thank you for all the excellent bioinformatic software 🥇 😁

Ben

BenjaminJPerry commented 1 year ago

I tried using the latets release of makeblastdb an had the same error,

inscrutable$ makeblastdb -version
makeblastdb: 2.13.0+
 Package: blast 2.13.0, build Jul 18 2022 22:49:37

inscrutable$ makeblastdb -in GTDB-latest.fna -input_type fasta -dbtype nucl -taxid_map taxid.map -parse_seqids -out GTDB-r207

Building a new DB, current time: 11/24/2022 09:38:13
New DB name:   /bifo/scratch/2022-BJP-GTDB_Benchmarking/gtdb-latest/GTDB-r207
New DB title:  GTDB-latest.fna
Sequence type: Nucleotide
Keep MBits: T
Maximum file size: 3000000000B
Error: NCBI C++ Exception:
    T0 "/opt/conda/conda-bld/blast_1658184301332/work/blast/c++/src/corelib/ncbistr.cpp", line 640: Error: (CStringException::eConvert) ncbi::NStr::StringToInt() - Cannot convert string '2988443261' to int, overflow (m_Pos = 0)
shenwei356 commented 1 year ago

We hashed the taxon name (in lower case) of each taxon node to uint64 using xxhash and converted it to uint32 (max value: (1<<32) - 1 = 4294967295). While it looks like more than one tool (https://github.com/shenwei356/gtdb-taxdump/issues/4) stores a taxid as an int32 (max value: (1<<31) - 1 = 2147483647).

It's time for change.

shenwei356 commented 1 year ago

Just updated the code. Please test it.

$ grep GCF_000980105.1 gtdb-taxdump/R207/taxid.map 
GCF_000980105.1 840959613

I'll update https://github.com/shenwei356/gtdb-taxdump later.

shenwei356 commented 1 year ago

Tagged a new release: v0.14.0