qiyunlab / HGTector

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.
BSD 3-Clause "New" or "Revised" License
131 stars 35 forks source link

Database Download Issuse #138

Open Rounak-Kumawat opened 1 month ago

Rounak-Kumawat commented 1 month ago

hgtector database -o db_dir/ --threads 16 Database building started at 2024-10-17 17:41:03.528510. Using local file taxdump.tar.gz. Reading NCBI taxonomy database... done. Total number of TaxIDs: 2614239. Using local file assembly_summary_refseq.txt. Reading RefSeq assembly summary... done. Total number of genomes: 400927. Genome categories: archaea, bacteria, fungi, protozoa Traceback (most recent call last): File "/home/stm3/miniforge3/envs/hgtector/bin/hgtector", line 96, in main() File "/home/stm3/miniforge3/envs/hgtector/bin/hgtector", line 35, in main module(args) File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 131, in call self.retrieve_categories() File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 368, in retrieve_categories asmset = set(get_categories('RefSeq')) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/stm3/miniforge3/envs/hgtector/lib/python3.12/site-packages/hgtector/database.py", line 330, in get_categories raise ValueError( ValueError: "archaea" is not a valid RefSeq genome category. i try other command also like hgtector database -o db_dir/ --cats all --threads 10 Database building started at 2024-10-17 16:28:31.818942. Using local file taxdump.tar.gz. Reading NCBI taxonomy database... done. Total number of TaxIDs: 2614327. Using local file assembly_summary_refseq.txt. Reading RefSeq assembly summary... done. Total number of genomes: 397638. Filtering genomes... Done. Filtering genomes by taxonomy... Dropped 9052 genomes without capitalized organism name. Dropped 5171 genomes with one or more blocked words in organism name. Dropped 3 genomes without valid taxId. Done. Total number of sampled genomes: 383412. Downloading non-redundant genomic data from NCBI... WARNING: Cannot retrieve GCF_000001215.4_Release_6_plus_ISO1_MT_protein.faa.gz. WARNING: Cannot retrieve GCF_000001405.40_GRCh38.p14_protein.faa.gz. WARNING: Cannot retrieve GCF_000001635.27_GRCm39_protein.faa.gz. WARNING: Cannot retrieve GCF_000001735.4_TAIR10.1_protein.faa.gz. WARNING: Cannot retrieve GCF_000002035.6_GRCz11_protein.faa.gz. WARNING: Cannot retrieve GCF_000002075.1_AplCal3.0_protein.faa.gz. WARNING: Cannot retrieve GCF_000002235.5_Spur_5.0_protein.faa.gz.

can you resolve the issues

qiyunzhu commented 1 month ago

Hello @Rounak-Kumawat Thank you for reporting this. I noticed that NCBI FTP's structure is evolving, making the old script struggle. I am working on updating the "database.py" script. Will keep you updated!