ncbi / sra-tools

SRA Tools
Other
1.14k stars 248 forks source link

[feature request]: Document non-interactive cloud setup for sra toolkit #282

Closed seandavi closed 4 years ago

seandavi commented 4 years ago

It is quite possible that I am missing some docs somewhere, but one common use case is to provide a user script (or dockerfile) to create a reproducible environment for cloud work. Using vdb-config in interactive mode is what appears in the documentation that I found. Would it be possible to provide documentation that is more scriptable?

kwrodarmer commented 4 years ago

Hi @seandavi - we are moving toward using Docker as our delivery mechanism. With 2.10.2 we introduced a step that requires one-time human intervention unless this has been dealt with at an admin level. Are you having issues on Biowulf?

seandavi commented 4 years ago

No, this is really for cloud deployment. There are a lot of options, but the best approaches involve using code to describe infrastructure and environment. In a cloud environment, I can think of a couple of things that would be relevant: 1) using service account credentials and 2) providing a mechanism to configure sra-toolkit without needing to do the setup interactively. I know number 2 is possible by copying config files. The first seems a bit unclear at this point. You talk about specifying key files; that shouldn't be needed in a cloud environment if you are using cloud SDKs. Probably just documentation.

No urgency. Just wanted to capture my thought.

kwrodarmer commented 4 years ago

See my comment in #291 as well.

With cloud usage come a number of issues that never existed before, primarily related to access and billing. Controlled-access data have been available for a long time in dbGaP, but it was always free of charge. Current cloud-provider support is limited and forces us to operate under some very new and surprising conditions. The SRA Toolkit has had to contort in many ways to try to preserve the previous operating paradigm, but there are some things that are currently beyond our control.

The interactive restriction on configuration is directly tied to user consent (not automated) and in related cases to coherency and potential race conditions.

seandavi commented 4 years ago

Thanks, @kwrodarmer. I'm glad you mentioned the consent issue; it is worth publicizing that aspect.

mvdbeek commented 4 years ago

Would it be possible to abandon the config file altogether or to make it optional ? This would make shipping reproducible pipelines considerably easier, and avoids any kind of consent issue altogether (since you'd have to use the relevant command line switch).

kwrodarmer commented 4 years ago

Hi @mvdbeek - the config step is now more necessary than ever before.

mvdbeek commented 4 years ago

Can you elaborate why that is so ?

If billing issues arise for certain subsets of operations I would naively think you could require either an up to date config file (as you do now) for that particular operation or an additional command line argument (or environment variable). This way you wouldn't force projects like Galaxy to ship a default config (which seems to me like what we would need to do now).

kwrodarmer commented 4 years ago

There's always a default config! We anticipate the tools operating concurrently with shared config files, creating a race condition if there are any configuration issues needed.

This are pieces being developed to fill in gaps. I am sorry this is not an instantaneous process, but for now what you have is what we can offer. I would expect it to change going forward.

mvdbeek commented 4 years ago

Right, but it appears the default config is now not sufficient to do anything until

vdb-config --interactive

is run.

In the case of Galaxy there are no concurrency issues, every execution gets its own config file (by means of an isolated home directory and vdb-config --restore-defaults). That seems to not work anymore ... if you could restore that at one point that'd be great.

kwrodarmer commented 4 years ago

@mvdbeek - you are generally correct about the need to run vdb-config once beforehand. This is part of what I mentioned we are trying to avoid, but for today it is a fact of life that results from conditions beyond the control of this team or project. Please understand that we are working on alleviating it as soon as possible.

As far as "In the case of Galaxy there are no concurrency issues", this can be said about any customer's conditions. We try to address all of the needs we can with a single product. To address each customer individually is not possible for our team.

s-andrews commented 4 years ago

From another report in case people come across this before the new infrastructure is put into place. There is a workround in the docker config file which is now shipped, but will work elsewhere, which is simply to run:

printf '/LIBS/GUID = "%s"\n' `uuidgen` > /root/.ncbi/user-settings.mkfg
kwrodarmer commented 4 years ago

@s-andrews - there is a reason we did not publish this before. Your recommendation is likely to cause the exact race condition we were trying to avoid, depending upon how it is employed.

s-andrews commented 4 years ago

You did publish it before - it's written in your docker file which you pointed people to. The only race you could get would be between running this and starting a program which would use it. Nothing is conceptually different about that from running vdb-config -i other than it slows them down a bit. Heck - add a sleep statement on the end if that's the main concern.