phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

Updated location of SISTR database URL #51

Closed apetkau closed 2 years ago

apetkau commented 3 years ago

The previous URL for the SISTR database (https://irida.corefacility.ca/downloads/sistr/database/SISTR_V_1.1_db.tar.gz) will soon no longer be available. We have moved the database to a new URL https://sairidapublic.blob.core.windows.net/downloads/sistr/database/SISTR_V_1.1_db.tar.gz.

This change won't need a new release of SISTR since a copy of the database files is stored in the PyPI package and installed when using pip install sistr_cmd (or conda install). However, this URL will need to be used for anyone who installs via the GitHub repo to do development on SISTR.

kbessonov1984 commented 3 years ago

The Conda recipe also pulls PyPI archive so it is also not affected by url change. How long would this new hosting will last? Will test tomorrow and see how it works and then merge

apetkau commented 3 years ago

Thanks for that. Yes, conda installation is unaffected as well.

The new hosting should be permanent.

Thanks for taking the time to test this out.

kbessonov1984 commented 3 years ago

@apetkau @Takadonet I had tested the package inside docker container and it works with a caveat. Install from source is a bit problematic because it assumes that database files are initialized. If it is not the case, the package compiles, but you get file paths errors that do not allow the setup_sistr_dbs() to be executed to mitigate this issue due to imports of those resources at the very beginning of the sistr_cmd module.

The way on mitigate those issues is to download database via the python3 -c "from sistr.sistr_cmd import *; setup_sistr_dbs()" command followed by python3 setup.py install. One can mitigate this issue by updating setup.py that would automatically download databases before package compile time as databases are imported as package resources and not as some external dependencies.

Otherwise the new database link https://sairidapublic.blob.core.windows.net/downloads/sistr/database/SISTR_V_1.1_db.tar.gz works OK.

root@ae47ca3822c5:/sistr_cmd# sistr --help
Traceback (most recent call last):
  File "/usr/local/bin/sistr", line 33, in <module>
    sys.exit(load_entry_point('sistr-cmd==1.1.1', 'console_scripts', 'sistr')())
  File "/usr/local/bin/sistr", line 25, in importlib_load_entry_point
    return next(matches).load()
  File "/usr/lib/python3.9/importlib/metadata.py", line 77, in load
    module = import_module(match.group('module'))
  File "/usr/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 627, in _load_backward_compatible
  File "<frozen zipimport>", line 259, in load_module
  File "/usr/local/lib/python3.9/dist-packages/sistr_cmd-1.1.1-py3.9.egg/sistr/sistr_cmd.py", line 15, in <module>
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 627, in _load_backward_compatible
  File "<frozen zipimport>", line 259, in load_module
  File "/usr/local/lib/python3.9/dist-packages/sistr_cmd-1.1.1-py3.9.egg/sistr/src/cgmlst/__init__.py", line 11, in <module>
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 627, in _load_backward_compatible
  File "<frozen zipimport>", line 259, in load_module
  File "/usr/local/lib/python3.9/dist-packages/sistr_cmd-1.1.1-py3.9.egg/sistr/src/serovar_prediction/__init__.py", line 5, in <module>
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 627, in _load_backward_compatible
  File "<frozen zipimport>", line 259, in load_module
  File "/usr/local/lib/python3.9/dist-packages/sistr_cmd-1.1.1-py3.9.egg/sistr/src/serovar_prediction/constants.py", line 12, in <module>
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1130, in resource_filename
    return get_provider(package_or_requirement).get_resource_filename(
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1739, in get_resource_filename
    return self._extract_resource(manager, zip_path)
  File "/usr/lib/python3/dist-packages/pkg_resources/__init__.py", line 1761, in _extract_resource
    timestamp, size = self._get_date_and_size(self.zipinfo[zip_path])
KeyError: 'sistr/data'
kbessonov1984 commented 3 years ago

I had added to setup.py a call to setup_sistr_dbs() to init databases before package build and install that solves the issue of direct install from source code. I had tested and it works

apetkau commented 3 years ago

Thanks @kbessonov1984 . I will test out your changes.

I had tested out installing and running myself in a conda environment (prior to your changes) and it all seemed to work good so I am not sure if your specific problem is related to Docker or a specific version of Python. I had started to try to test it out again but hadn't had a chance to finish that.

apetkau commented 3 years ago

Thanks again for testing this out @kbessonov1984 and your suggested changes.

So one issue that exists with the code you have is that if you import sistr.sistr_cmd prior to installing SISTR, you are also importing all the dependencies prior to Python installing the dependencies. This leads to the following error when trying to install with pip install . on the code (assuming . here is the current directory containing the SISTR code):

Processing /home/CSCScience.ca/apetkau/workspace/sistr_cmd_2
  Preparing metadata (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /home/CSCScience.ca/apetkau/miniconda3/envs/sistr/bin/python3.9 -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/home/CSCScience.ca/apetkau/workspace/sistr_cmd_2/setup.py'"'"'; __file__='"'"'/home/CSCScience.ca/apetkau/workspace/sistr_cmd_2/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-yc6spelk
       cwd: /home/CSCScience.ca/apetkau/workspace/sistr_cmd_2/
  Complete output (9 lines):
  Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/home/CSCScience.ca/apetkau/workspace/sistr_cmd_2/setup.py", line 4, in <module>
      from sistr.sistr_cmd import setup_sistr_dbs
    File "/home/CSCScience.ca/apetkau/workspace/sistr_cmd_2/sistr/sistr_cmd.py", line 14, in <module>
      from sistr.src.blast_wrapper import BlastRunner
    File "/home/CSCScience.ca/apetkau/workspace/sistr_cmd_2/sistr/src/blast_wrapper/__init__.py", line 7, in <module>
      import pandas as pd
  ModuleNotFoundError: No module named 'pandas'

If you wanted to have the databases setup prior to packaging you'd have to find some way to trigger the setup_sistr_dbs() function after dependencies have been resolved and SISTR has been installed.

Also, the original issue you described (KeyError: 'sistr/data') is not something I am able to reproduce when I install SISTR in a conda environment. The database downloads just fine as soon as you first run SISTR.

kbessonov1984 commented 3 years ago

@apetkau Had added database pre-install with this new commit and tested install via python setup.py install and pip install ./sistr_cmd routes successfully. The database is downloaded into the source code at sistr/data before package build and install.

kbessonov1984 commented 2 years ago

I had tried to both do installs both via python setup.py install and pip install . routes successfully on a fresh Ubuntu system. I also tried to install them inside a conda environment and both install routes worked

conda create --name sistr python=3.8 pycurl mash blast
conda activate sistr
#install via setup.py
python setup.py install
#install via pip
pip install .

During custom install after the main package is installed, the databases are initialized preventing a race condition such as when a user will launch sistr on various threads at the same time. It is possible to do database init manually by running sistr_init command. This approach solves a local install from the source via pip. Currently the PyPI package comes with databases so no database init is necessary.

apetkau commented 2 years ago

Awesome, thanks so much @kbessonov1984 . This looks great. I tested it out via conda and it is installing properly now. Thanks again.