populationgenomics / automated-interpretation-pipeline

Rare Disease variant prioritisation MVP
MIT License
5 stars 4 forks source link

Revolutionise the Images! #219

Open MattWellie opened 1 year ago

MattWellie commented 1 year ago

Thrown a few ideas down here in a haphazard way... This doesn't start to address the broader CPG questions around how to structure multi-layered docker builds across the images portfolio


AIP is currently built on top of the CPG driver image (for ease), and runs all jobs in that container. There are several issues with this approach:

This is done for convenience, and could be finessed out so that each job uses the minimum container.


Benefits?

Issues?

That seems like a lot of complexity, and requires quite a few images to be built for a new deployment. The core issue here is that a lot of the 'heavy lifting' in AIP is done in Hail, which is in turn reliant on the GCP/Azure SDK, so that is a chunky image. It's not really possible to make an end-to-end AIP image that's lightweight and cloud-independent. There's probably some combination that caters to all needs better than the current implementation though.

lgruen commented 1 year ago

The Hail library itself doesn't need the GCP / Azure SDK, so that could maybe serve as a common thinner base image.

not a huge issue, but reads & writes are coded lazily

Do you have an example of how this is implemented currently? Is it using gcloud storage cp or similar explicitly?

MattWellie commented 1 year ago

Not using console commands explicitly, but a lot of reliance on CloudPath to read/write directly to cloud locations - It wouldn't be many LOC to change this behaviour, but I'd just have to track down all those usages in code and swap them over to expecting a local file.

e.g. here - a direct write to the path passed as a CLI argument, which in this case is a GCP path, but could just as easily be a {batch.output}

lgruen commented 1 year ago

If I understand correctly, cloudpathlib should still work just fine without the SDKs, as it's either using the cloud provider's client libraries or direct REST APIs in the implementation.

That Hail team just recently removed the SDK from their images (https://github.com/hail-is/hail/pull/12526), so maybe we should try to do the same at least for a smaller base image that e.g. the AIP could build on?

It would break scripts that use gsutil directly (like https://cs.github.com/populationgenomics/ancestry/blob/0a25b549c670f6fe06f407852b10c22ff71ad483/scripts/copy_1kg_hgdp_tob_wgs_pca.sh), so maybe we'd still want it as a convenience for the standard analysis-runner driver.

@illusional You might be interested in this as well.

MattWellie commented 1 year ago

Oh, spicy. So cloudpathlib operates exclusively from env CREDENTIALS, not the gcloud library. That's useful.