Install from bioconda but use local database?

schorlton commented 2 years ago

Thanks for the great tool.

As per the title, Is it possible to install MOB-suite from bioconda but use a local DB such that it doesn't have to build it each time I install the package and I can cache the DB locally?

kbessonov1984 commented 2 years ago

Hi, Unfortunately the database initialization is critical for the Galaxy wrapper that uses this conda package and they need full automation. If you want to just have a bare install without any database initialization, you can run install via pip3 inside a conda environment (pip3 install mob_suite). The database init in conda package is done with post-link.sh script which will not be triggered with you install via pip3.

Could you also provide more context? Are you trying to setup the tool inside a virtual machine and mount the database directory?

schorlton commented 2 years ago

Thanks for your quick response!

I actually couldn't find post-link.sh in the repo...are you able to link me?

Sure - I use conda inside docker images for env management. I prefer conda to pip as I can handle my non-python packages, install from bioconda, etc.. I frequently break my docker cache by changing an earlier layer and then have to rebuild the MOB-suite database every time. Also, for reproducibility and stability, it's nice to be able to maintain your own copy of a database...if that link to NCBI changed in the database init script, my docker image would break and I'd have no backup.

Thanks!

kbessonov1984 commented 2 years ago

I see ... In that case you would need to create a custom conda package off the "official one". You can just grab a meta.yaml file from this link https://github.com/bioconda/bioconda-recipes/tree/master/recipes/mob_suite and build a conda package with the tar.bz2 extension. Then you can take that file and install the tool from it in your conda environment on the VM conda install *.tar.bz2.

PS: The post_link.sh file that I was referring to is also located at that link and is run after conda package install.

schorlton commented 2 years ago

Thanks! This is helpful. It seems that post_link.sh just runs mob_init which downloads the database from Zenodo and builds BLAST and mash databases, and also downloads NCBI taxonomy for ete3 to convert to sqlite DB. Is there a reason for not just hosting a tar ball of these 3 databases precompiled on Zenodo? It may actually be beneficial for the NCBI taxonomy database to be in sync with the other databases, as taxonomy IDs change not infrequently. It would also alleviate building it each time on conda install, but there very well could be issues I'm missing. Thanks again!

kbessonov1984 commented 2 years ago

The mob_init script downloads plasmid and other databases and builds a mash sketch from the plasmid database fasta file. Indeed mash sketch can be also be included in the Zenodo download, but it does not take a lot of resources and time to build. What takes much more resources is installation of the ete3 library which includes download of the fresh taxonomy file from NCBI ftp and build of the taxa.sql file during install. Since we are not maintainers of the ete3 library used for the host range MOB-Suite module, we can not change that process. I think it is a good compromise to build custom conda package without mob_init post-install procedure in case you plan to mount databases folder for the docker images. In most scenarios users would prefer simplicity of automatic database install. In addition it is possible to build a single Docker image without MOB-Suite databases initialized (to save on image size) and deploy several containers from that image that would mount a single database directory on the host file system. A lite image w/o databases is available under the3.0.3_lite tag (docker pull kbessonov/mob_suite:3.0.3_lite). See https://hub.docker.com/repository/docker/kbessonov/mob_suite

To update taxonomy of the ete3 package one needs to run the following commands in Python (http://etetoolkit.org/docs/latest/tutorial/tutorial_ncbitaxonomy.html)

from ete3 import NCBITaxa
ncbi = NCBITaxa()
ncbi.update_taxonomy_database()

Hope this helps, have a nice weekend

schorlton commented 2 years ago

Thanks for your response! That sounds like a good solution :+1:

Another thing to consider is that you could actually package the NCBI taxonomy with your MOB-Suite DB. Pretty sure you can specify the DB location to ete3 with: ncbi = NCBITaxa(dbfile=MY_DB.sqlite)

This may alleviate the bulk of the build time. As well, could it be an issue if the NCBI Taxonomy database drifts far from the MOB-Suite database such that you want to keep them version controlled together?

nick-youngblut commented 2 years ago

Since we are not maintainers of the ete3 library used for the host range MOB-Suite module, we can not change that process

Is there an existing issue posted in the ete3 github repo for this? Defaulting to installing relatively large amounts of files/data in a user's home directory upon install is a recipe for problems. In my case installing mob-suite 3.0.3 filled up my home directory (the admin limits the size & number of files) and resulted in a failed install due to a disk quota error. Many users probably wouldn't understand where the disk quota error comes from.

kbessonov1984 commented 2 years ago

Yes, this ETE3 library issue was resolved by new PR since July 2, 2021 (link). The latest MOB-Suite version 3.0.3 downloads a taxonomy database into databases directory of the package install directory and not at $HOME as was the case before. The database fixed path is defined in constants.py. We would expand the -d parameter to allow for a custom ETE3 taxonomy database taxa.sqlite path definition.

nick-youngblut commented 2 years ago

I just tried installing bioconda::mob_suite=3.0.3 and it installed files into $HOME/.etetoolkit/ and not into the conda env directory. I'm using Ubuntu 18.04.6

kbessonov1984 commented 2 years ago

Sorry for the delay.

I had tried to replicate the behaviour and see if any taxa.sqlite file is also located in $HOME/.etetoolkit/ and indeed a second copy of this file was found there. I can confirm that ete3 taxonomy database is BOTH installed by default inside the mob-suite package databases folder (e.g., /usr/local/lib/python3.9/dist-packages/mob_suite/databases/taxa.sqlite) and $HOME/.etetoolkit/ even though dbfile NCBITaxa class parameter is defined to the databases package folder (see here). This looks like an issue of the ete3 library. All we can do is to remove this residual file as well as the taxdump.tar.gz temporary file (that we are doing already).

The only solution I found is to manually download taxdump.tar.gz from http://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz and pass its location during NCBITaxa object initialization such as NCBITaxa(os.path.join(database_directory,"taxa.sqlite"), os.path.join(database_directory,"taxdump.tar.gz")) function as the expected two parameters.

Here is an example of log message for current MOB-suite version 3.0.3 where $HOME=/root/ is the home directory.

2022-05-05 14:11:47,990 mob_suite.utils INFO: Init ete3 library ... [in /usr/local/lib/python3.9/dist-packages/mob_suite/mob_init.py:224]
NCBI database not present yet (first time used?)
Downloading taxdump.tar.gz from NCBI FTP site (via HTTP)...
Done. Parsing...
Loading node names...
2418423 names loaded.
275147 synonyms loaded.
Loading nodes...
2418423 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /root/.etetoolkit/taxa.sqlite ...
 2418000 generating entries...
Uploading to /root/.etetoolkit/taxa.sqlite

Inserting synonyms:      275000
Inserting taxid merges:  65000
Inserting taxids:       2415000
Local taxdump.tar.gz seems up-to-date
Loading node names...
2418423 names loaded.
275147 synonyms loaded.
Loading nodes...
2418423 nodes loaded.
Linking nodes...
Tree is loaded.
Updating database: /usr/local/lib/python3.9/dist-packages/mob_suite/databases/taxa.sqlite ...
 2418000 generating entries...
Uploading to /usr/local/lib/python3.9/dist-packages/mob_suite/databases/taxa.sqlite

jrober84 commented 2 years ago

I am going to close this issue for now as it is from ETE3 and not something we can change within MOB-suite

phac-nml / mob-suite

Install from bioconda but use local database? #97