RFC: Improve OCI Image Python Tooling

pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.

Other

332 stars 110 forks source link

Description

Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.

Container based services:

Kubernetes / Volcano scheduler
AWS EKS / Batch
Google AI Platform training
Recent versions of slurm https://slurm.schedmd.com/containers.html

TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493

Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.

It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see. Making it easier to work with this will likely help with the cloud story.

Detailed Proposal

Create a library for Python to manipulate OCI images with the following subset of features:

download/upload images to OCI repos
append layers to OCI images

Non-goals:

Execute containers
Dockerfiles

Alternatives

Additional context/links

There is an existing oci-python library but it's fairly early. May be able to build upon it to enable this.

I opened an issue there as well: https://github.com/vsoch/oci-python/issues/15

~$ torchx run -s kubernetes dist.ddp --script compute_world_size/main.py -j 1x1 torchx 2023-01-23 14:47:40 INFO loaded configs from /home/user/playground/torchx_examples/torchx/examples/apps/.torchxconfig torchx 2023-01-23 14:47:40 INFO Checking for changes in workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps`... torchx 2023-01-23 14:47:40 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically. torchx 2023-01-23 14:47:40 INFO Workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps` resolved to filesystem path `/home/user/playground/torchx_examples/torchx/examples/apps` torchx 2023-01-23 14:47:40 WARNING failed to pull image ghcr.io/pytorch/torchx:0.4.0, falling back to local: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory')) torchx 2023-01-23 14:47:40 INFO Building workspace docker image (this may take a while)... ... [trace left out, can attach it if required]

pytorch / torchx