phac-nml / mob-suite

MOB-suite: Software tools for clustering, reconstruction and typing of plasmids from draft assemblies
Apache License 2.0
125 stars 33 forks source link

taxadump location ? #38

Closed EricDeveaud closed 4 years ago

EricDeveaud commented 5 years ago

Hello,

on out cluster, compute nodes does not have network acces to outside after generating a singularity container with https://github.com/phac-nml/mob-suite/blob/master/mob_suite/singularity/recipe.singularity

we have the following problem.

2019-11-05 14:31:52,788 DEBUG: Found 79 records (with duplicates) in the reference database.(37 unique and 42 duplicated) [in /opt/miniconda/envs/py36/lib/python3.6/site-packages/mob_suite/mob_host_range.py:456]

NCBI database not present yet (first time used?)

Downloading taxdump.tar.gz from NCBI FTP site (via HTTP)...

Traceback (most recent call last):

[SNIP long traceback]

  File "/opt/miniconda/envs/py36/lib/python3.6/socket.py", line 713, in create_connection

    sock.connect(sa)

OSError: [Errno 101] Network is unreachable

I try to dig on that problem, a I noticed that mob-init does not install taxdump.tar.gz here is the content of dtaabase directory

module load singularity
singularity shell mob-suite-2.0.1.simg:~/eric> ls /opt/miniconda/envs/py36/lib/python3.6/site-packages/mob_suite/databases/
__init__.py                mpf.proteins.faa        rep.dna.fas
__pycache__                ncbi_plasmid_full_seqs.fas      repetitive.dna.fas
host_range_literature_plasmidDB.csv    ncbi_plasmid_full_seqs.fas.msh  repetitive.dna.fas.nhr
host_range_ncbirefseq_plasmidDB.csv    ncbi_plasmid_full_seqs.fas.nhr  repetitive.dna.fas.nin
literature_mined_plasmid_seq_db.fasta      ncbi_plasmid_full_seqs.fas.nin  repetitive.dna.fas.nsq
literature_mined_plasmid_seq_db.fasta.msh  ncbi_plasmid_full_seqs.fas.nsq  status.txt
mob.proteins.faa               orit.fas

building the image we see that taxdump.tar.gz is downloaded in / see:

Singularity mob-suite-2.0.1.simg:/> ls /taxdump.tar.gz 
/taxdump.tar.gz

what must be the correct location for taxdump.tar.gz

kbessonov1984 commented 5 years ago

The ete3 package initializes it's databases in ~/.etetookit/ directory by default. Just copy all files from that directory during container build and it should work. This would allow mob_hostrange to work properly and handle taxonomy id conversions well. More on this issue can be found at https://github.com/etetoolkit/ete/issues/295

EricDeveaud commented 5 years ago

thanks but IMHO

not a suitable solution for a container. If I create the coontainer, I will have the ~/.etetookit/ directory but if someone else run the container, it will exprience the same error. it should be more suitable to "embed" the taxa.sqlite in the container

is there a way to specify a location for taxa.sqlite. env var or something else.

kbessonov1984 commented 5 years ago

Would you prefer to see taxa.sqlite inside databases directory for easier container construction in case there is no Internet access? As far as I understand, the mob_init still needs to be run on a machine with Internet access and files copied to a container during image build up. What will be ideal solution for you? I am just trying to understand the context

EricDeveaud commented 5 years ago

yes container build is done on a computer with internet access and mob init is run at container build time, so there's no problem to download the files.

on our cluster compute nodes does not have access to internet so mob_init and other tools requiring network access are prone to failure.

hosting the taxa db in the container won't be a solution neither, as it needs to be writable, and container is not

ideal will be to have an option on mob_* tools that will allow user to specify the taxa.sqlite location to use. this way we could (as admin) host an manage the taxa.sqlite file installation and update.

hope that it give you a better 'appercu' of the situation.

EricDeveaud commented 5 years ago

I currently use as workaround a wrapper that claims that

before running mob-suite tools you need to copy
 /some/shared/apth/taxa.sqlite to ${HOME}/.etetoolkit/taxa.sqlite

if ${HOME}/.etetoolkit/taxa.sqlite does not exists

kbessonov1984 commented 5 years ago

Thank you for more details. I think if mob_suite database directory is mountable for a singularity container, then the database flag (-d) can also be used to point to that mounted writable location where all database files (mash sketch, fasta sequences,all ete3 database files including taxa.sqlite) are stored, then it will be an ideal flexible solution. This way you can initialize that database folder once and make it usable across any container. We can implement this in next release fairly soon. Thank you for a new user case. Perhaps you would be able to share new singularity recipe. Are you using Linux to build container?

EricDeveaud commented 5 years ago

great idea for the "shared folder" as it will allow for updates instead of rebuilding the conainer.

yes building containers on linux and I will provide updated recipe

kbessonov1984 commented 5 years ago

I have finally moved all ete3 taxonomy database dependencies (taxa.sqlite) to the site-packages/mob_suite/databases/ tool folder so it can be easily mountable/sharable outside the container. I like this idea as it allows to keep all databases in a single location. This would allow to run singularity containers without Internet connection and easily update databases as needed.

This functionality is available from version 2.0.2.

kbessonov1984 commented 5 years ago

It was discovered that that -d parameter controlling custom specification of the databases directory is not working in the mob_hostrange module resulting in initialization of ete3 taxonomy database file taxa.sqlite in the default mob_suite package path (../site-packages/mob_suite/databases/). I would like to keep all database files in a single location and allow mob_hostrange module to accept custom location of the taxa.sqlite passed by the -d parameter both from the mob_hostrange and other modules (mob_typer and mob_recon). This is especially relevant for read-only containers (e.g. singularity) that do not allow write operations in container forcing to mount the default location of the already initialized databases folder (.../lib/python3.6/site-packages/mob_suite/databases).