Open d4l3k opened 2 years ago
I think I stumbled across this limitation just now. Was trying to get torchx running with a fresh k8s cluster using CRI-O instead of docker/containerd as the runtime and it always fails when trying to pull the image (which I immagine being only the first of a few "problematic" steps).
~$ torchx run -s kubernetes dist.ddp --script compute_world_size/main.py -j 1x1
torchx 2023-01-23 14:47:40 INFO loaded configs from /home/user/playground/torchx_examples/torchx/examples/apps/.torchxconfig
torchx 2023-01-23 14:47:40 INFO Checking for changes in workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps`...
torchx 2023-01-23 14:47:40 INFO To disable workspaces pass: --workspace="" from CLI or workspace=None programmatically.
torchx 2023-01-23 14:47:40 INFO Workspace `file:///home/user/playground/torchx_examples/torchx/examples/apps` resolved to filesystem path `/home/user/playground/torchx_examples/torchx/examples/apps`
torchx 2023-01-23 14:47:40 WARNING failed to pull image ghcr.io/pytorch/torchx:0.4.0, falling back to local: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))
torchx 2023-01-23 14:47:40 INFO Building workspace docker image (this may take a while)...
... [trace left out, can attach it if required]
Could you please confirm this is actually related to the issue you are talking about? If it indeed is, will it be enough to install the docker runtime in parallel, just to get the toolchain in the back up and running? Also, are there any other steps required to get such a setup running?
Best regards
Description
Quite a few of the cloud services / cluster tools for running ML jobs use OCI/Docker containers so I've been looking into how to make dealing with these easier.
Container based services:
TorchX currently supports patches on top of existing images to make it fast to iterate and then launch a training job. These patches are just overlaying files from the local directory on top of a base image. Our current patching implementation relies on having a local docker daemon to build a patch layer and push it: https://github.com/pytorch/torchx/blob/main/torchx/schedulers/docker_scheduler.py#L437-L493
Ideally we could build a patch layer and push it in pure Python without requiring any local docker instances since that's an extra burden on ML researchers/users. Building a patch should be fairly straightforward since it's just appending to a layer and pushing will require some ability to talk to the registry to download/upload containers.
It seems like OCI containers are a logical choice to use for packaging ML training jobs/apps but the current Python tooling is fairly lacking as far as I can see. Making it easier to work with this will likely help with the cloud story.
Detailed Proposal
Create a library for Python to manipulate OCI images with the following subset of features:
Non-goals:
Alternatives
Additional context/links
There is an existing oci-python library but it's fairly early. May be able to build upon it to enable this.
I opened an issue there as well: https://github.com/vsoch/oci-python/issues/15