mskcc / cmo

Command-line tools for data analysts at the CMO
GNU General Public License v2.0
7 stars 13 forks source link

Use conda to manage deps #8

Closed inodb closed 5 years ago

inodb commented 8 years ago

Not sure what way you are installing all the dependencies atm, but have you considered using anaconda and launching jobs with a specific conda environment? We have had some success with that approach in our pipeline and it def made switching clusters much easier (https://github.com/jrflab/modules/wiki/Conda-Environment-Notes).

jchodera commented 8 years ago

Our lab has set up an automated conda packaging and distribution pipeline for our omnia software consortium, so we have a decent amount of experience if you have questions about this route:

http://omnia.md

lordzappo commented 7 years ago

Sorry to leave this sit for so long- the long term intent here was always to use Docker for packaging tools, but security concerns have delayed adoption on the luna cluster.

We will be pursuing Singularity as a docker alternative on luna for portability in tandem with CWL for portable workflows. It still needs to be proven to play well with CWL, which has specific Docker hooks for some things, but we already have it installed on demo nodes and are actively evaluating it for this purpose.

Our goal is to have all versions of everything live for all users at all times, so conda doesn't fit that paradigm well.

This package and fireworks will remain live for prototyping and keeping in-house things as easy as possible, but production pipelines will be moving to CWL after its feasibility is proven. This package may or may not receive CWL writing support as our plans become more firm.

inodb commented 7 years ago

@lordzappo Thanks for the update! I like the idea of going for CWL, mostly since it has much broader support in the bioinformatics community.

As a side note, you can still have versions of everything live for all users at all time using conda. You just need to create different virtual environments for each version. Sooner or later you run into problems of one program requiring a different version of another program. You can certainly handle this problem by manipulating PATH variables, but it's a lot less structured imo.

Docker can solve this problem as well - you could still use anaconda within Docker though. The main issue with packaging software is that you need the community support. Otherwise you end up having to package everything yourself. Bioconda has a ton of packages already installed and quite a large following: https://github.com/bioconda/bioconda-recipes. There is BioContainers for docker, but the support is way smaller: https://github.com/BioContainers/containers. If one would set up a way to create Docker containers from bioconda packages, it would make a ton of bioinf packages readily available.

lordzappo commented 7 years ago

We're not that keen on bioconda for a number of reasons, but perhaps some of them are misconceptions on our part.

First, these pipelines need to be accessible by more than one user, and transferring a yaml file at the command line is not an acceptable solution- there needs to be a potentially unlimited number of versions of pipelines available for all users at all times, without any configuration on their part.

Second is clarity. If a binary is updated in the conda model, for instance bwa, my understanding is that both binary calls would appear as "bwa", whereas in our model that is version aware, you will see the version in the command line. It seems like this ends in either naming the environments very verbosely or always shitting out the bioconda stack into a file per pipeline run. I guess that's not terrible, but having the version in the command line is probably better.

Third, we're interested in supporting command line wrappers for all tools that make up a pipeline, and this would seem to get really confusing with the environment model, i.e, what if i want to run a specific version of bwa via command line wrapper, i have to find an appropriate matching environment or create a new one to begin I think. Related, our command line tools are fasta-location,gtf-location, etc aware, obscuring those details for the user, and getting those to install and work inside a bioconda environment sounds more complex or at least no easier than what we are already doing.

Finally, last time I looked at bioconda, it only had one or two versions available of many common programs, but for instance I currently have five different versions of samtools installed on luna , so it doesn't really solve the problem of getting binaries for us, since it is missing things we require, and anyway getting binaries is not something I do very often/a real paint point of this process.

I do agree that packaging software is non-trivial, but the good news here is that we can start life with less portable cwl wrappers and bare binaries and turn them into docker containers as need and time allows without paying any particular penalty for waiting. We also are hiring staff with the intent to work on only this, so doing all of it ourselves is a viable option.

The bioshadock resource seems to have about 24,000 packages and indexes the biocontainers, so its roughly an order of magnitude larger than biocontainers itself https://docker-ui.genouest.org/app/#/all/containers

From what I've seen, it sounds like bioconda is a good solution for a single user pipeline that is usually applied to a whole project at one time , I think our use case is probably quite different from that though, due to having multiple users, needing to deliver consistent results using past pipelines while at the same time delivering new projects on the newest pipelines, and needing to deliver all things at the command line as standalone tools.

On Tue, Dec 13, 2016 at 9:01 PM, Ino de Bruijn notifications@github.com wrote:

@lordzappo https://github.com/lordzappo Thanks for the update! I like the idea of going for CWL, mostly since it has much broader support in the bioinformatics community.

As a side note, you can still have versions of everything live for all users at all time using conda. You just need to create different virtual environments for each version. Sooner or later you run into problems of one program requiring a different version of another program. You can certainly handle this problem by manipulating PATH variables, but it's a lot less structured imo.

Docker can solve this problem as well - you could still use anaconda within Docker though. The main issue with packaging software is that you need the community support. Otherwise you end up having to package everything yourself. Bioconda has a ton of packages already installed and quite a large following: https://github.com/bioconda/bioconda-recipes. There is BioContainers for docker, but the support is way smaller: https://github.com/BioContainers/containers. If one would set up a way to create Docker containers from bioconda packages, it would make a ton of bioinf packages readily available.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/mskcc/cmo/issues/8#issuecomment-266921073, or mute the thread https://github.com/notifications/unsubscribe-auth/ABX7-kyPtTzM5iiuADosMFKmYRVYYFdpks5rH04UgaJpZM4KBCXC .

ckandoth commented 5 years ago

Closing cuz of all reasons listed above against using conda.