Alter Docker build file

MattWellie commented 7 months ago

This started as a deletion of the sample-metadata dependency, which is redundant with the inclusion of the newer metamist dependency. happy to sit back on that as a minimal change (unless there's a reason we are keeping both deps?)

The proposed changes split the crazy 3.6GB mono-layer into 3 components:

os-install standard system tools
install the hail version (which is linked to a specific hash release, we expect this to change infrequently
install python libraries (we should probably do a better job of version-pinning this, or building from a requirements file)

My belief is that this change:

splits the one mega layer into 3, which would vastly improve pulling of layers in parallel. If one of these layers ends up being 99% of that build time so be it, we can try and optimise further.
splits the layers into never-changes, rarely-changes, and may-frequently-change in chronological order, so that instead of pulling 3.6GB to get the latest version of this image, you'd need to pull the variable ~MB layer (provided that you have a prior version of this image on the appropriate server/local machine).

AFAIK this is standard docker theory, and the current design we have is non-optimal

Update: layer sizes

This results in 3 layers:

458MB
2.5GB
616MB

So unsurprisingly most of the weight is in the Hail installation, but it's a little more spread out. I'm experimenting with moving PhantomJS into the relatively static layer 1 as well.

illusional commented 7 months ago

Thanks for the suggestion @MattWellie!

I've been thinking about this a bit over the week, I'm totally here for this PR, but our current build mechanism causes a full image rebuild everytime this is run, and almost certainly changes in docker hashes => extra layers, a larger image size. I wonder if it's worth breaking these up into different tagged images, so we're able to pull in the specific image to reduce the time to build (which I'd really love).

On a similar note, it would be great to move this image to the images repo, and as part of that, have some way to chain images together, so a rebuild of some base image could cause a rebuild to chained images. How that works with floating tags I'm not 100% sure yet.

MattWellie commented 7 months ago

FYI @illusional https://github.com/populationgenomics/images/issues/139 (I haven't assigned you, but you should be aware that this is a related issue)

MattWellie commented 7 months ago

Closing this, related issue about beefing up our image building more generally

populationgenomics / analysis-runner

Alter Docker build file #682