kubernetes-sigs / jobset

JobSet: a k8s native API for distributed ML training and HPC workloads
https://jobset.sigs.k8s.io/
Apache License 2.0
145 stars 47 forks source link

example containers are too chonky #231

Closed vsoch closed 11 months ago

vsoch commented 1 year ago

I'm testing out the pytorch example:

And it's taking almost 20 minutes to pull so far:

$ kubectl get pods --all-namespaces 
NAMESPACE            NAME                                         READY   STATUS              RESTARTS   AGE
default              pytorch-workers-0-0-zm7fk                    0/2     ContainerCreating   0          20m
default              pytorch-workers-0-1-wvl7p                    0/2     ContainerCreating   0          20m
default              pytorch-workers-0-2-894d5                    0/2     ContainerCreating   0          20m
default              pytorch-workers-0-3-q7m6n                    0/2     ContainerCreating   0          20m

And verify it's pulling:

Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  20m   default-scheduler  Successfully assigned default/pytorch-workers-0-0-zm7fk to kind-control-plane
  Normal  Pulling    20m   kubelet            Pulling image "gcr.io/k8s-staging-jobset/pytorch-mnist:latest"

Unless my internet is wonky today... oh no, he's chonky:

image

I suspect large data is stored in there (or the model) and I'm wondering if this is really a suggested practice, or a best example for JobSet. Possibly we could brainstorm ideas, if not to reduce the size of this actual container, to provide a dummy example that won't take more than a few minutes to pull. Someone testing out JobSet should have a quick way to do that.

Updated title: the resnet container is even larger!

kannon92 commented 1 year ago

Maybe @tenzen-y has some ideas on smaller examples for PyTorchJobs?

danielvegamyhre commented 1 year ago

Yeah, ML container images in general are very large (multi-GB). The examples in the examples/simple folder do not use ML container images though, so for those the container image pull time is minimal (a few seconds at most), but those don't use real ML frameworks or do any distributed model training, they simply demonstrate JobSet features failure policies, etc.

If there are any small (<1GB) ML container images then I would be happy to put together an example, but in my experience so far they are all fairly large (for example, in the official pytorch docke image repo the smallest tag I see is ~3GB: https://hub.docker.com/r/pytorch/pytorch/tags).

vsoch commented 1 year ago

Good idea @danielvegamyhre - I have a lot on my Q this weekend (actually, involving JobSet, can't wait to share at some point!) but I'll test out these simple examples soon.

tenzen-y commented 1 year ago

Maybe @tenzen-y has some ideas on smaller examples for PyTorchJobs?

@kannon92 @danielvegamyhre @vsoch Sorry for the late response. I just came back from kubeflow code freeze. Actually, I maintain the minimum PyTorch image (1.78GiB) that only supports CPU in the kubeflow/katib repo. I hope the image is helpful with the jobset users.

https://hub.docker.com/layers/kubeflowkatib/pytorch-mnist-cpu/latest/images/sha256-6a28c7358934f00c58dd835a813cdbe19b8878a467a8fe46681647770ff8af2e?context=explore

kannon92 commented 11 months ago

/close

Not sure there is anything actionable on this side.

k8s-ci-robot commented 11 months ago

@kannon92: Closing this issue.

In response to [this](https://github.com/kubernetes-sigs/jobset/issues/231#issuecomment-1858178356): >/close > >Not sure there is anything actionable on this side. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.