populationgenomics / production-pipelines

Genomics workflows for CPG using Hail Batch
MIT License
2 stars 0 forks source link

Refactoring main for local dev #771

Open milo-hyben opened 1 month ago

milo-hyben commented 1 month ago

This PR is an idea how to refactor main.py so we can run prod-pipelines locally without any changes to the code. This would make the development and testing of production-pipelines locally much easier. Atm developer has 2 options, comment out the prod code or create a new temp local copy of main.pay file. None of the options is ideal.

This core idea is to have 'PL_ENVIRONMENT' set to e.g. dev and then main picks development version of WORKFLOWS. If PL_ENVIRONMENT is not set it assumes production version and it picks production version of WORKFLOWS.

There are two new files:

  1. workflows_prd.py
  2. workglofs_dev.py
MattWellie commented 1 month ago

I feel weird about modifying the entrypoint of a pipeline which is designed to run in a container, in order that we can run it outside a container.

This does feel like a clean solution, but equally wouldn't building a docker container and using that for local testing be more appropriate, so that we know that "it works on my machine" == "it works in production". Most modern IDEs can run tests while using a python environment inside a container AFAIK (though I can't remember if that's a paid/premium feature, it's been a while since I used it).

Config would also have to be altered so that the pipeline runs in a local mode...

I'm also not sure exactly what changes are being tested in the situation this PR addresses, so perhaps that wouldn't be a good solution.

jmarshall commented 1 month ago

I don't especially like the proposed solution — as it seems to lead to duplication, for one thing — and I don't really see what it buys you: if you were working on one particular workflow, you're still going to have to edit _workflowsdev.py to contain (only?) that workflow. That's basically equivalent to commenting things out in main.py, so it doesn't seem like a big win.

But more importantly I still don't understand what the problem this is trying to solve is.

The example given in the meeting was ”it'd be nice not to have to have hail installed locally to do prod-pipes development”. But prod-pipes is very very bound into hail, so I'm not sure what workflows you could be working on locally that wouldn't need to be able to import hailtop.batch. (Moreover main.py imports cpg_workflows which imports cpg_utils.hail_batch which imports hail — so it seems to me that you can't do anything at all in prod-pipes without the hail modules installed…)