hgtector database -o db_dir/ --threads 16
Database building started at 2024-10-17 17:41:03.528510.
Using local file taxdump.tar.gz.
Reading NCBI taxonomy database... done.
Total number of TaxIDs: 2614239.
Using local file assembly_summary_refseq.txt.
Reading RefSeq assembly summary... done.
Total number of genomes: 400927.
Genome categories: archaea, bacteria, fungi, protozoa
Traceback (most recent call last):
File "/home/stm3/miniforge3/envs/hgtector/bin/hgtector", line 96, in
main()
File "/home/stm3/miniforge3/envs/hgtector/bin/hgtector", line 35, in main
module(args)
File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 131, in call
self.retrieve_categories()
File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 368, in retrieve_categories
asmset = set(get_categories('RefSeq'))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 330, in get_categories
raise ValueError(
ValueError: "archaea" is not a valid RefSeq genome category.
i try other command also like
hgtector database -o db_dir/ --cats all --threads 10
Database building started at 2024-10-17 16:28:31.818942.
Using local file taxdump.tar.gz.
Reading NCBI taxonomy database... done.
Total number of TaxIDs: 2614327.
Using local file assembly_summary_refseq.txt.
Reading RefSeq assembly summary... done.
Total number of genomes: 397638.
Filtering genomes...
Done.
Filtering genomes by taxonomy...
Dropped 9052 genomes without capitalized organism name.
Dropped 5171 genomes with one or more blocked words in organism name.
Dropped 3 genomes without valid taxId.
Done.
Total number of sampled genomes: 383412.
Downloading non-redundant genomic data from NCBI...
WARNING: Cannot retrieve GCF_000001215.4_Release_6_plus_ISO1_MT_protein.faa.gz.
WARNING: Cannot retrieve GCF_000001405.40_GRCh38.p14_protein.faa.gz.
WARNING: Cannot retrieve GCF_000001635.27_GRCm39_protein.faa.gz.
WARNING: Cannot retrieve GCF_000001735.4_TAIR10.1_protein.faa.gz.
WARNING: Cannot retrieve GCF_000002035.6_GRCz11_protein.faa.gz.
WARNING: Cannot retrieve GCF_000002075.1_AplCal3.0_protein.faa.gz.
WARNING: Cannot retrieve GCF_000002235.5_Spur_5.0_protein.faa.gz.
Hello @Rounak-Kumawat Thank you for reporting this. I noticed that NCBI FTP's structure is evolving, making the old script struggle. I am working on updating the "database.py" script. Will keep you updated!
hgtector database -o db_dir/ --threads 16 Database building started at 2024-10-17 17:41:03.528510. Using local file taxdump.tar.gz. Reading NCBI taxonomy database... done. Total number of TaxIDs: 2614239. Using local file assembly_summary_refseq.txt. Reading RefSeq assembly summary... done. Total number of genomes: 400927. Genome categories: archaea, bacteria, fungi, protozoa Traceback (most recent call last): File "/home/stm3/miniforge3/envs/hgtector/bin/hgtector", line 96, in
main()
File "/home/stm3/miniforge3/envs/hgtector/bin/hgtector", line 35, in main
module(args)
File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 131, in call
self.retrieve_categories()
File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 368, in retrieve_categories
asmset = set(get_categories('RefSeq'))
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 330, in get_categories
raise ValueError(
ValueError: "archaea" is not a valid RefSeq genome category.
i try other command also like
hgtector database -o db_dir/ --cats all --threads 10
Database building started at 2024-10-17 16:28:31.818942.
Using local file taxdump.tar.gz.
Reading NCBI taxonomy database... done.
Total number of TaxIDs: 2614327.
Using local file assembly_summary_refseq.txt.
Reading RefSeq assembly summary... done.
Total number of genomes: 397638.
Filtering genomes...
Done.
Filtering genomes by taxonomy...
Dropped 9052 genomes without capitalized organism name.
Dropped 5171 genomes with one or more blocked words in organism name.
Dropped 3 genomes without valid taxId.
Done.
Total number of sampled genomes: 383412.
Downloading non-redundant genomic data from NCBI...
WARNING: Cannot retrieve GCF_000001215.4_Release_6_plus_ISO1_MT_protein.faa.gz.
WARNING: Cannot retrieve GCF_000001405.40_GRCh38.p14_protein.faa.gz.
WARNING: Cannot retrieve GCF_000001635.27_GRCm39_protein.faa.gz.
WARNING: Cannot retrieve GCF_000001735.4_TAIR10.1_protein.faa.gz.
WARNING: Cannot retrieve GCF_000002035.6_GRCz11_protein.faa.gz.
WARNING: Cannot retrieve GCF_000002075.1_AplCal3.0_protein.faa.gz.
WARNING: Cannot retrieve GCF_000002235.5_Spur_5.0_protein.faa.gz.
can you resolve the issues