qiyunlab / HGTector

HGTector2: Genome-wide prediction of horizontal gene transfer based on distribution of sequence homology patterns.
BSD 3-Clause "New" or "Revised" License
131 stars 35 forks source link

about database build! #123

Closed jiyanhan closed 1 year ago

jiyanhan commented 1 year ago

Hi, I used the following command to build database:

hgtector database -o /disks/node2_RAID6_60TB/database/hgtector --default

But while downloading the data, the script crashed with the following error:

Database building started at 2023-07-05 21:15:36.623397. The default protocol is selected for database building. The program will download all protein sequences of NCBI RefSeq genomes of bacteria, archaea, fungi and protozoa, keep one genome per species, plus all NCBI-defined reference and representative genomes. Using local file taxdump.tar.gz. Reading NCBI taxonomy database... done. Total number of TaxIDs: 2514661. Using local file assembly_summary_refseq.txt. Reading RefSeq assembly summary... done. Total number of genomes: 315511. Genome categories: archaea, bacteria, fungi, protozoa Downloading genome list per RefSeq category... Using local file refseq_archaea.txt. archaea: 1568 Using local file refseq_bacteria.txt. bacteria: 297241 Using local file refseq_fungi.txt. fungi: 544 Using local file refseq_protozoa.txt. protozoa: 96 Done. Traceback (most recent call last): File "/opt/biosoft/miniconda3/envs/hgtector/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3653, in get_loc return self._engine.get_loc(casted_key) File "pandas/_libs/index.pyx", line 147, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 176, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 7080, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7088, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: '# assembly_accession'

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/opt/biosoft/miniconda3/envs/hgtector/bin/hgtector", line 96, in main() File "/opt/biosoft/miniconda3/envs/hgtector/bin/hgtector", line 35, in main module(args) File "/opt/biosoft/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/database.py", line 131, in call self.retrieve_categories() File "/opt/biosoft/miniconda3/envs/hgtector/lib/python3.8/site-packages/hgtector/database.py", line 368, in retrieve_categories self.df = self.df[self.df['# assembly_accession'].isin(asmset)] File "/opt/biosoft/miniconda3/envs/hgtector/lib/python3.8/site-packages/pandas/core/frame.py", line 3761, in getitem indexer = self.columns.get_loc(key) File "/opt/biosoft/miniconda3/envs/hgtector/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3655, in get_loc raise KeyError(key) from err KeyError: '# assembly_accession'

mengli2022 commented 1 year ago

Is your problem solved? I have the same problem.

jeankeller commented 1 year ago

Hi, I've got the same issue. It is due to a space line 368 of the database.py script (self.df = self.df[self.df['# assembly_accession'].isin(asmset)] should be self.df = self.df[self.df['#assembly_accession'].isin(asmset)]). One workaround is to make a first attempt, once it failed:

  1. chmod u+w download/assembly_summary_refseq.txt
  2. sed -i "s/#assembly_accession/# assembly_accession/g" download/assembly_summary_refseq.txt (or edit the 2nd line with a text editor like notepad++)
  3. re-run (with the same -o directory)

Best, Jean

qiyunzhu commented 1 year ago

Hello @jeankeller @jiyanhan @mengli2022 , thank you for reporting and finding solution to this issue. It is because NCBI has updated their assembly summary table format. I updated the code to reflect this change (#125 ). You may update HGTector using the following command and the problem should be solved.

pip install --force-reinstall --no-cache-dir git+https://github.com/qiyunlab/HGTector.git