Open MattWellie opened 1 year ago
The Hail library itself doesn't need the GCP / Azure SDK, so that could maybe serve as a common thinner base image.
not a huge issue, but reads & writes are coded lazily
Do you have an example of how this is implemented currently? Is it using gcloud storage cp
or similar explicitly?
Not using console commands explicitly, but a lot of reliance on CloudPath to read/write directly to cloud locations - It wouldn't be many LOC to change this behaviour, but I'd just have to track down all those usages in code and swap them over to expecting a local file.
e.g. here - a direct write to the path passed as a CLI argument, which in this case is a GCP path, but could just as easily be a {batch.output}
If I understand correctly, cloudpathlib
should still work just fine without the SDKs, as it's either using the cloud provider's client libraries or direct REST APIs in the implementation.
That Hail team just recently removed the SDK from their images (https://github.com/hail-is/hail/pull/12526), so maybe we should try to do the same at least for a smaller base image that e.g. the AIP could build on?
It would break scripts that use gsutil
directly (like https://cs.github.com/populationgenomics/ancestry/blob/0a25b549c670f6fe06f407852b10c22ff71ad483/scripts/copy_1kg_hgdp_tob_wgs_pca.sh), so maybe we'd still want it as a convenience for the standard analysis-runner driver.
@illusional You might be interested in this as well.
Oh, spicy. So cloudpathlib operates exclusively from env CREDENTIALS, not the gcloud library. That's useful.
Thrown a few ideas down here in a haphazard way... This doesn't start to address the broader CPG questions around how to structure multi-layered docker builds across the images portfolio
AIP is currently built on top of the CPG driver image (for ease), and runs all jobs in that container. There are several issues with this approach:
This is done for convenience, and could be finessed out so that each job uses the minimum container.
Benefits?
Issues?
That seems like a lot of complexity, and requires quite a few images to be built for a new deployment. The core issue here is that a lot of the 'heavy lifting' in AIP is done in Hail, which is in turn reliant on the GCP/Azure SDK, so that is a chunky image. It's not really possible to make an end-to-end AIP image that's lightweight and cloud-independent. There's probably some combination that caters to all needs better than the current implementation though.