populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
6 stars 1 forks source link

Remove/modify Query_command? #647

Open MattWellie opened 7 months ago

MattWellie commented 7 months ago

We use query_command a lot in prod-pipes. This method design is unpleasant:

This is a really backwards process that enables some barely-predictable runtime behaviours (e.g. pip installing packages at runtime).

I will advocate for replacing all this with the following:

Each job that would currently run using query_command would now run a specific script path inside a specific image, so we'd more easily troubleshoot failures

jmarshall commented 7 months ago

I largely agree with that, except that replacing code that is currently contained in your production-pipelines branch/commit with code from a specific image — so that to iterate and develop it, you would have to build a new image — would be a massive step backwards.

What we need is a simple way to transport scripts from your production-pipelines branch/commit to the worker node, which is something that @cassimons and I have discussed in the past. That was stalled but has started to progress again.

MattWellie commented 7 months ago

except that replacing code that is currently contained in your production-pipelines branch/commit with code from a specific image — so that to iterate and develop it, you would have to build a new image — would be a massive step backwards.

I don't agree - the current problems are evidence that taking the code from one image and the libs in another is already causing problems, and we have the grotty transcription from python-to bash-to python in addition to that. Testing new query_command executed code at the moment requires overriding the target image in config, so it's at worst a step sideways, but not backwards.

The pipeline was designed to be iterated on by iterating on the images - the old CI workflow built a new image with each commit with the intention that each code change = an image change, which is then used to test the change. That's a decision that still haunts us, and we could do with a strategy to fix that.

Copying a script/file from the current container into GCP batch temp, then copying the temp file into batches to execute is a workaround for creating the script file, but doesn't address dependencies. If we want a test run to prove that an updated library produces equivalent or better results, it's too easy to think we're testing the latest libraries, but accidentally use the canonical cpg_workflows image to execute, which was the problem we had yesterday.

MattWellie commented 7 months ago

I would just add here that this isn't a unique to hail batch problem, and for every other containerised pipeline in the world the solution is 'in this container run this script'. I think we're over-complicating the problem by doing anything else.

cassimons commented 6 months ago

I think our current dev experience for standalone scripts (Python or R) within pipelines is really unpleasant and we need and we need to reduce the friction and improve the debugging experience. As John mentioned above, I had been thinking a potential solution would be in better hail level support for moving the scripts into the execution container - but your comments @MattWellie ring very true.

If we were to do what you suggest, what would we need to do on the image build side of things to make it easy and intuitive? When someone commits a change to a script and then runs the pipeline we need to be able to assume that the new code will always be executed. Sorry if this is implicit in what you have written above and I just missed it.

MattWellie commented 6 months ago

The option I've been playing with is the high friction version - the script-running jobs need the script to be baked into the image. To iterate on the script you need to create a new image with the latest prod-pipes commit. That would be required if the new process requires new dependencies, but in most situations it'll be overkill.

The easiest IMO is to replace/change the query_command wrapper with a git clone of the current repo and commit, as we do with all driver jobs. That way we're pulling in the exact code we want to execute, and we can either hard code the path or create a scripts section in the default pipeline config. No mess no fuss

jmarshall commented 6 months ago

Cas and I workshopped a convenient form of transporting script files a while back, and I have a draft of the API addition that was on the back burner waiting for support from upstream hail. But it doesn't really need to wait for the upstream convenience, so I should turn that into a PR so we can try it.

Personally I think “make this file right here available” is even less fuss and mess than doing yet another git clone from another worker. And more generally useful.

illusional commented 6 months ago

IMO, the "git clone" is just a sketchy version of this.

MattWellie commented 6 months ago

I think this should be discussed properly in a forum where we're all present, async comments are not useful. Just dropping some thoughts here on ways we can make the dev experience relating to this smoother.

Points against git cloning, hail uploading, or bash transcription:

Keen to discuss this in person, xox

vivbak commented 6 months ago

We are going to re-instate the production pipelines meeting soon! Once we do, we can chat about this. Alternatively, if it's not soon enough, data office hours :)