jts / ncov-tools

Small collection of tools for performing quality control on coronavirus sequencing data and genomes
MIT License
47 stars 16 forks source link

build snpeff database at installation not run time. #78

Open poquirion opened 3 years ago

poquirion commented 3 years ago

Right now snpeff download its database at run time. And on top of that it installs it in the CONDA_PREFIX folder There are at least three cases where this will crash the pipeline.

1- When the pipeline is ran on a system with no internet access. 2- When the pipeline is ran in a (read only) container 3- When conda is installed as one user and the pipeline in ran as another user.

poquirion commented 3 years ago

I do it for the version we run at the Genome Center by adding that line at the end of the Dockerfile:

RUN bash -ic '/app/scripts/build_db.py'

and removing rules build_snpeff_db and download_db_files from the workflow/rules/annotation.smk files.

rdeborja commented 3 years ago

The snakemake file does have a dependency on CONDA_PREFIX for the database as mentioned. The goal was to simplify the process and was setup with conda in mind.

(1) Yes, if there is no internet access and the snpeff db has not been download before hand it will fail.

(2) This would be correct assuming the container was not a copy that already had the snpeff db downloaded.

(3) Does the other user have access to the conda environment or are they completely isolated. If they are isolated, they will need to perform the download independently.

It seems like you installed snpeff outside conda. Is this correct?

poquirion commented 3 years ago

Question, If the db is downloaded, will the steps be automatically skipped? If yes, then I will just install the db the container in the deployment script. This will make out life on the CC system easier.

Then for your question, it is installed in the conda environment since the RUN bash -ic '/app/scripts/build_db.py will be the last line in the dockerfile and the bash -ic '' force the conda env to be loaded.