pytorch / torchx

TorchX is a universal job launcher for PyTorch applications. TorchX is designed to have fast iteration time for training/research and support for E2E production ML pipelines when you're ready.
https://pytorch.org/torchx
Other
308 stars 100 forks source link

RFC: Improve OCI Image Python Tooling #388

Open d4l3k opened 2 years ago

d4l3k commented 2 years ago

Description

Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.

Container based services:

TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493

Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.

It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see. Making it easier to work with this will likely help with the cloud story.

Detailed Proposal

Create a library for Python to manipulate OCI images with the following subset of features:

Non-goals:

Alternatives

Additional context/links

There is an existing oci-python library but it's fairly early. May be able to build upon it to enable this.

I opened an issue there as well: https://github.com/vsoch/oci-python/issues/15

Migsi commented 1 year ago

I think I stumbled across this limitation just now. Was trying to get torchx running with a fresh k8s cluster using CRI-O instead of docker/containerd as the runtime and it always fails when trying to pull the image (which I immagine being only the first of a few "problematic" steps).

~$ torchx run -s kubernetes dist.ddp --script compute_world_size/main.py -j 1x1
torchx 2023-01-23 14:47:40 INFO     loaded configs from /home/user/playground/torchx_examples/torchx/examples/apps/.torchxconfig
torchx 2023-01-23 14:47:40 INFO     Checking for changes in workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps`...
torchx 2023-01-23 14:47:40 INFO     To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2023-01-23 14:47:40 INFO     Workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps` resolved to filesystem path `/home/user/playground/torchx_examples/torchx/examples/apps`
torchx 2023-01-23 14:47:40 WARNING  failed to pull image ghcr.io/pytorch/torchx:0.4.0, falling back to local: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
torchx 2023-01-23 14:47:40 INFO     Building workspace docker image (this may take a while)...

... [trace left out, can attach it if required]

Could you please confirm this is actually related to the issue you are talking about? If it indeed is, will it be enough to install the docker runtime in parallel, just to get the toolchain in the back up and running? Also, are there any other steps required to get such a setup running?

Best regards