ncbi / sra-tools

SRA Tools
Other
1.14k stars 249 forks source link

Different read/write cache locations: docker containers #41

Open evanfloden opened 8 years ago

evanfloden commented 8 years ago

I am experimenting running prefetch within docker containers which are executed across multiple HPC environments.

Due to permissions issues across the shared file system and the Docker user set up, ideally I would like to be able to check from within the container if an SRR & its requirements are present in the cache location (which lives outside the container and is mounted) and if it is not, write them to a 'within' container cache location.

I have currently set up the first part but cannot understand how to have a different write location using vdb-config. Any insight would be much appreciated.

kwrodarmer commented 8 years ago

You should probably set up two levels of cache - one which we call "site" configuration, generally populated by some centralized authority and therefore considered read-only to the client - and the other considered the client's personal "user" cache location, which is the one that is enabled by default. If this sounds like something that would address your issue then it is easy to do, however we have not yet published a how-to for setting up a site repository.

Within NCBI, most users have their user-cache turned off, and utilize our "site" repository, which is the public SRA itself. Some large installations have set up similar cases, where they have their own cache of popular or active runs maintained by a systems group.

See the description of name resolution at https://github.com/ncbi/ncbi-vdb/wiki/Name-Resolution-Process and let us know if this sounds like something that would address your issue. Meanwhile, we'll try to prepare a quick Wiki page today to describe the steps for setting up a site repository.

evanfloden commented 8 years ago

Thanks for the quick reply. I was thinking this exact thing after I posted the question. I will post my Dockerfile here for others to use as some instructions when it is complete. The basic idea is a read-only 'site' repo which lives outside the Docker container and is mounted. Then a 'user' repo is set up and populated at run time. After the execution, the two can be merged outside the container environment.

kwrodarmer commented 8 years ago

Yes, exactly - this was one of the ways we envisioned people building a site repository, was to periodically gather users' repositories into a common location.

kwrodarmer commented 8 years ago

I should also mention that we are now looking at producing a Docker container for the SRA Toolkit, along with some third party tools such as SRA-aware GATK and Hisat2, plus other software that is useful for processing SRA data.

evanfloden commented 8 years ago

Yeah, that sounds promising. If it helps, here is my first attempt at a Dockerfile that uses the above technique and appears to be functional at my end.

It includes HISAT2, Stringtie, SAMTools, Ballgown, aspera, sra-tools and ncbi-vdb plus all dependencies.

The vdb gets configured to have 2 repos (local/user and site). The site repo can then be mounted by docker using the -vcommand:

    docker run -v <your_ncbi_repo>:/ncbi_site <container_id>

I use Nextflow to tie the pipeline together. The above Docker is to develop a pipeline based on the recent Nature Protocols publication here, however, we will be looking to create a more integrated solution for accessing SRA/Seq data for Nextflow + container technology.

evanfloden commented 8 years ago

It should also be noted that from within the container the following would be executed to use prefetch with aspera.

prefetch -a "/home/sra_user/.aspera/connect/bin/ascp|/home/sra_user/.aspera/connect/etc/asperaweb_id_dsa.openssh" -t fasp ${sra_id} 
klymenko commented 3 years ago

Here is the information about SRA tools docker: https://github.com/ncbi/sra-tools/wiki/SRA-tools-docker