Remove/modify Query_command?

MattWellie commented 7 months ago

We use query_command a lot in prod-pipes. This method design is unpleasant:

we pass a module, a function name, and tuple of arguments to query_command
query_command builds a bash script around this module to start a hail runtime, monitor storage space, etc
it pulls the module source contents, and runs a cat to write those contents to a file
the final line of the file is a run instruction to execute the specified methods, done by writing <method name><argument tuple>

This is a really backwards process that enables some barely-predictable runtime behaviours (e.g. pip installing packages at runtime).

I will advocate for replacing all this with the following:

take each process which runs using query_command and split out into a script with a specific entrypoint. This script should be fixed in version control.
each use of query_command should be replaced with running a python script with CLI arguments.
a job wrapper can still take care of authentication and periodically monitoring space, rather than transcribing a python script through Bash

Each job that would currently run using query_command would now run a specific script path inside a specific image, so we'd more easily troubleshoot failures

jmarshall commented 7 months ago

I largely agree with that, except that replacing code that is currently contained in your production-pipelines branch/commit with code from a specific image — so that to iterate and develop it, you would have to build a new image — would be a massive step backwards.

What we need is a simple way to transport scripts from your production-pipelines branch/commit to the worker node, which is something that @cassimons and I have discussed in the past. That was stalled but has started to progress again.

MattWellie commented 7 months ago

except that replacing code that is currently contained in your production-pipelines branch/commit with code from a specific image — so that to iterate and develop it, you would have to build a new image — would be a massive step backwards.

I don't agree - the current problems are evidence that taking the code from one image and the libs in another is already causing problems, and we have the grotty transcription from python-to bash-to python in addition to that. Testing new query_command executed code at the moment requires overriding the target image in config, so it's at worst a step sideways, but not backwards.

The pipeline was designed to be iterated on by iterating on the images - the old CI workflow built a new image with each commit with the intention that each code change = an image change, which is then used to test the change. That's a decision that still haunts us, and we could do with a strategy to fix that.

Copying a script/file from the current container into GCP batch temp, then copying the temp file into batches to execute is a workaround for creating the script file, but doesn't address dependencies. If we want a test run to prove that an updated library produces equivalent or better results, it's too easy to think we're testing the latest libraries, but accidentally use the canonical cpg_workflows image to execute, which was the problem we had yesterday.

MattWellie commented 7 months ago

I would just add here that this isn't a unique to hail batch problem, and for every other containerised pipeline in the world the solution is 'in this container run this script'. I think we're over-complicating the problem by doing anything else.

cassimons commented 6 months ago

I think our current dev experience for standalone scripts (Python or R) within pipelines is really unpleasant and we need and we need to reduce the friction and improve the debugging experience. As John mentioned above, I had been thinking a potential solution would be in better hail level support for moving the scripts into the execution container - but your comments @MattWellie ring very true.

If we were to do what you suggest, what would we need to do on the image build side of things to make it easy and intuitive? When someone commits a change to a script and then runs the pipeline we need to be able to assume that the new code will always be executed. Sorry if this is implicit in what you have written above and I just missed it.

MattWellie commented 6 months ago

The option I've been playing with is the high friction version - the script-running jobs need the script to be baked into the image. To iterate on the script you need to create a new image with the latest prod-pipes commit. That would be required if the new process requires new dependencies, but in most situations it'll be overkill.

The easiest IMO is to replace/change the query_command wrapper with a git clone of the current repo and commit, as we do with all driver jobs. That way we're pulling in the exact code we want to execute, and we can either hard code the path or create a scripts section in the default pipeline config. No mess no fuss

jmarshall commented 6 months ago

Cas and I workshopped a convenient form of transporting script files a while back, and I have a draft of the API addition that was on the back burner waiting for support from upstream hail. But it doesn't really need to wait for the upstream convenience, so I should turn that into a PR so we can try it.

Personally I think “make this file right here available” is even less fuss and mess than doing yet another git clone from another worker. And more generally useful.

illusional commented 6 months ago

IMO, the "git clone" is just a sketchy version of this.

MattWellie commented 6 months ago

I think this should be discussed properly in a forum where we're all present, async comments are not useful. Just dropping some thoughts here on ways we can make the dev experience relating to this smoother.

We used to have a CI hook which would build a new image from every new commit on a pull request. It was wasteful, but it was intended to solve this situation - while a feature was under development, the test runs would be executed in the test image (with the corresponding code and dependencies). If it worked in test, it would work in main, as the whole environment was encapsulated. That code has since been remove as it was pretty wasteful, CI workflow code used to generate PR-tagged images here.
At that point in time we only had one artifactory, partially as a result of this workflow we have a bazillion images which we will never use again (deleting the old images is probably out of scope for this discussion).
We could reinstate that workflow, write to images-dev instead of the main artifactory, with an expiry policy so that dev images are deleted, just as we delete data in tmp buckets. This might be possible (doc), but I don't have the permissions to really investigate.
While under development test runs would run using the version of the image attached to the branch/PR, including any scripts or dependencies being evaluated.
We can already do this using the github actions, it's a 2-click job, and I've used it just fine building out the Exomiser functionality. Having this auto-generate based on a development branch would streamline that process even further.
This everything-in-the-image approach removes the need to support behaviour like query_command permitting pip-installs at runtime inside a job container - this type of behaviour can lead to transient software version installs, which are reproducibility issues.
Each of the query_command processes which currently works by transcribing code and printing a method call at the end would become a fixed script, baked into the image, with an argparse/click CLI entrypoint. This would be run as a regular job, matching this example.
Use of an argparse/click entrypoint will avoid all the shell expansion/non-expansion issues we've had recently - the parameters will be correctly expanded as shell inputs, and fed into the script where they will be parsed as specific data types.

Points against git cloning, hail uploading, or bash transcription:

Every instance (AFAIK) of using query_command results in a script being transcribed into a container that already contains the script (cpg_workflows image contains a full cpg_workflows installation). Any discussion about how to get the code into the execution environment is missing the key fact that the code is already there, except that it's been written in a way that doesn't contain a CLI.

Keen to discuss this in person, xox

vivbak commented 6 months ago

We are going to re-instate the production pipelines meeting soon! Once we do, we can chat about this. Alternatively, if it's not soon enough, data office hours :)

populationgenomics / production-pipelines

Remove/modify Query_command? #647