smith-chem-wisc / Spritz

Software for RNA-Seq analysis to create sample-specific proteoform databases from RNA-Seq data
https://smith-chem-wisc.github.io/Spritz/
MIT License
7 stars 11 forks source link

Making Spritz cross-platform to cater to the genomics community #123

Closed acesnik closed 4 years ago

acesnik commented 6 years ago

Currently, our focus is fairly narrow: provide a guided tool to facilitate proteogenomics on Windows 10 Anniversary Edition. I'm fine with that narrow vision for my current work. But a bigger vision would be to provide a tool that facilitates all NGS analysis on any platform (Windows, Mac, Linux). Much of the genomics community uses the latter two platforms.

One option to do this is using Docker. Docker is a company and program that allow "containerization" of workflows. One can choose a base system (e.g. Ubuntu 14.04), add libraries on top of that, and docker containers can add to those libraries and add new tools that can be used. Ideally, this would be great. It would alleviate the large overhead for downloading all the tools initially when the program is first installed. It would also be available on any system.

There are two ways this could work. Currently, we have encapsulated scripts for WSL within a GUI. To make this a cross-platform application, we could 1) Incorporate docker and/or bash scripts within cross-platform (C++? Java?) GUI 2) Use Docker to deploy an interactive application that can be accessed from a browser.

acesnik commented 6 years ago

This is relevant. It does look like Docker containers have a ~10% hit in performance compared to using the native system. https://forums.docker.com/t/hyper-v-or-native-windows-containers/44631

acesnik commented 6 years ago

I gave Docker a try tonight, since it was fresh on my mind.

I think Docker would be very powerful for keeping Spritz lightweight (downloading only what is needed at the time) and stable (doesn't break if some library gets updated in a weird way, since the binaries could all be stored in a tested image).

It would be a major overhaul, though. Going from my bash scripting to Dockerfiles isn't particularly easy. The installation scripting I have written so far would need to be pulled apart into light installations for each individual workflow.

In terms of how Docker containers work, I think it would also be a little challenge to figure out how best to schlep files around out of the containers and into the output directory.

A bit about Docker:

Docker is a company that focuses on "containerizing" workflows Dockerfile is the flat file that's used to build a docker image. Here's an example Dockerfile. Docker image is what you get after building the Dockerfile. This has all of the binaries following the build, which is an improvement over relying on the installation commands to work every time like I currently am in Spritz. Docker container is the most confusing one. It turns out you "run" docker images, and what you get out the other side is a docker container. An example of this is docker run -it -d --name=example -e var1=variable acesnik/dockertest. Important here is that -d keeps the container from exiting at the end of the command (downloading an SRA archive in this case), so that the files can be transfered out at the end. Docker repository is the docker website where a Dockerfile is built and stored in the cloud. This is actually really slick. It links up with Github and pulls and tries to build the repository after every push to the Github repository. There are features for testing that the resulting image works.

acesnik commented 6 years ago

The -v option is really helpful. Here's an example command that mounts a folder E:\\data to the /mnt folder in the container. The files created there are stored on the hard drive and are accessible outside the Docker environment.

docker run --rm -i -t -v e:/data:/mnt lethalfang/scalpel:0.5.4