microsoft / ga4gh-tes

C# implementation of the GA4GH TES API; provides distributed batch task execution on Microsoft Azure
MIT License
33 stars 27 forks source link

Cache Docker images in Azure Storage as block blobs #209

Open MattMcL4475 opened 1 year ago

MattMcL4475 commented 1 year ago

Problem: In a fully private deployment, OR, to run a large number of concurrent TES tasks, Docker images for tasks need to be in a private Azure Container Registry (ACR). However, ACRs are slow, expensive, and can be throttled.

Solution: 1.  When TES runs a task, it should download the task's Docker image and upload it to the default Azure Storage account and store it as a block blob in a container named dockerimages with a name like broadinstitute/gatk/hash.tar where the hash is the Docker image hash (if TES cannot download the image and it doesn't exist in storage already, TES should fail the task with an error message indicating such). TES should also update the blob's metadata property lastused with a UTC timestamp in ISO 8601 format every time it is used. Documentation should be updated to indicate that it does this too, so that users can use other tools to delete old images, and/or TES can implement deletion in the future. 2.  When the node runs, it should download the Docker image(s) required from Azure Storage (need to refactor bash script and also likely can simplify/remove ContainerConfiguration)

Also should provide documentation on how users can delete the existing outbound allow rules, and manually add new images to Azure Storage for a fully private deployment (zero outbound allow rules).

MattMcL4475 commented 1 year ago

Example script:

#!/bin/bash

# Docker image details
IMAGE="broadinstitute/gatk@sha256:044112d3d70603732d4a654ecaee33919cf9d45332d47268f5f1697b6ed558ed"

# Extract repository, image name, and hash from the image reference
REPO_NAME=$(echo $IMAGE | cut -d'/' -f1)
IMAGE_NAME=$(echo $IMAGE | cut -d'/' -f2 | cut -d'@' -f1)
HASH_TYPE_AND_VALUE=$(echo $IMAGE | grep -o 'sha[0-9]*:[a-f0-9]*' | sed 's/:/_/')

# Construct the canonical name
# For IMAGE="broadinstitute/gatk@sha256:044112d3d70603732d4a654ecaee33919cf9d45332d47268f5f1697b6ed558ed"
# CANONICAL_NAME will be "broadinstitute_gatk_sha256_044112d3d70603732d4a654ecaee33919cf9d45332d47268f5f1697b6ed558ed.tar.gz"
CANONICAL_NAME="${REPO_NAME}_${IMAGE_NAME}_${HASH_TYPE_AND_VALUE}.tar.gz"

# Pull the image
docker pull $IMAGE

# Save the image to a TAR and compress it
docker save $IMAGE | gzip > $CANONICAL_NAME

# Upload to Azure Storage as block blob (assuming you've already logged in to Azure CLI and set the right subscription)
AZURE_STORAGE_ACCOUNT_NAME="your_storage_account_name"
AZURE_STORAGE_CONTAINER_NAME="your_container_name"
az storage blob upload --account-name $AZURE_STORAGE_ACCOUNT_NAME --container-name $AZURE_STORAGE_CONTAINER_NAME --type block --name $CANONICAL_NAME --type application/gzip --file $CANONICAL_NAME

# Optionally: Remove the local TAR file after uploading
rm $CANONICAL_NAME

echo "Image uploaded to Azure Storage with name: $CANONICAL_NAME"