We have a fundamental issue with our usage of cpg_workflows in Cloud infrastructure:
we use a cpg_workflows docker image, which installs a version of the production-pipelines codebase
we're using a flat layout (cpg_workflows is at the root of the project folder)
the first operation inside the driver image is to run a git clone of production-pipelines && cd into that folder
this places the local folder cpg_workflows at the front of the Python PATH (i.e. when main.py imports stages from cpg_workflows.X it uses the git cloned version of the code, not the installed version. The packages and versions available in the environment at runtime are sourced from the installation in the image). This is an inherent conflict
when we create a batch using main.py, we're in a position where cpg_utils, or seqr-loading-pipelines are sourced from the installation inside the image, while cpg_workflows is sourced from the cloned folder.
when we use query_command we pickle a bit of source code, pass it into a new VM, and run it there. When this code is run, we import the packages inside that image. i.e. for every one of our query_modules scripts which is run using query_command, we potentially import different versions of the cpg_workflows library (here) compared with when main.py was run to create the batch jobs.
Yes, this issue is just #647 in disguise, but seeing as #783 has been opened to split our pipelines into separate smaller pipelines for versioning reasons, it seems to me like we should address the more fundamental issue of 'which code are we even importing' before we start splitting up our codebase for more advanced versioning purposes.
Something as basic as moving to a src layout could be a partial solution to this. We would never be in a position to accidentally import from cpg_workflows import X and pick up the cloned code instead of the installed code. That would create stability between the code running when creating a batch, and the code running inside a batch.
The issue with the flat layout we have is that you can accidentally run code you don't intend to. Normally running in a container with pre-installed versions would overcome this, but analysis-runner clones the codebase into the running container before doing anything else, so we're right back at this vulnerability.
This change would make the pipeline more cumbersome to run while undergoing development - to test new changes, you need to push a new image and use it as a driver image - but it would add zero burden to code in main, and stability there should be the most important factor when we are trying to nail down exact versions of our tools and code.
Inspired by the python best practices @dancoates mentioned earlier in the week
https://www.pyopensci.org/python-package-guide/package-structure-code/python-package-structure.html
We have a fundamental issue with our usage of cpg_workflows in Cloud infrastructure:
cpg_workflows
is at the root of the project folder)cpg_workflows
at the front of the Python PATH (i.e. whenmain.py
imports stagesfrom cpg_workflows.X
it uses the git cloned version of the code, not the installed version. The packages and versions available in the environment at runtime are sourced from the installation in the image). This is an inherent conflictcpg_utils
, orseqr-loading-pipelines
are sourced from the installation inside the image, while cpg_workflows is sourced from the cloned folder.query_command
we pickle a bit of source code, pass it into a new VM, and run it there. When this code is run, we import the packages inside that image. i.e. for every one of ourquery_modules
scripts which is run usingquery_command
, we potentially import different versions of the cpg_workflows library (here) compared with whenmain.py
was run to create the batch jobs.Something as basic as moving to a
src
layout could be a partial solution to this. We would never be in a position to accidentally importfrom cpg_workflows import X
and pick up the cloned code instead of the installed code. That would create stability between the code running when creating a batch, and the code running inside a batch.The issue with the flat layout we have is that you can accidentally run code you don't intend to. Normally running in a container with pre-installed versions would overcome this, but analysis-runner clones the codebase into the running container before doing anything else, so we're right back at this vulnerability.
This change would make the pipeline more cumbersome to run while undergoing development - to test new changes, you need to push a new image and use it as a driver image - but it would add zero burden to code in
main
, and stability there should be the most important factor when we are trying to nail down exact versions of our tools and code.