Closed vsoch closed 11 months ago
Maybe @tenzen-y has some ideas on smaller examples for PyTorchJobs?
Yeah, ML container images in general are very large (multi-GB). The examples in the examples/simple
folder do not use ML container images though, so for those the container image pull time is minimal (a few seconds at most), but those don't use real ML frameworks or do any distributed model training, they simply demonstrate JobSet features failure policies, etc.
If there are any small (<1GB) ML container images then I would be happy to put together an example, but in my experience so far they are all fairly large (for example, in the official pytorch docke image repo the smallest tag I see is ~3GB: https://hub.docker.com/r/pytorch/pytorch/tags).
Good idea @danielvegamyhre - I have a lot on my Q this weekend (actually, involving JobSet, can't wait to share at some point!) but I'll test out these simple examples soon.
Maybe @tenzen-y has some ideas on smaller examples for PyTorchJobs?
@kannon92 @danielvegamyhre @vsoch Sorry for the late response. I just came back from kubeflow code freeze. Actually, I maintain the minimum PyTorch image (1.78GiB) that only supports CPU in the kubeflow/katib repo. I hope the image is helpful with the jobset users.
/close
Not sure there is anything actionable on this side.
@kannon92: Closing this issue.
I'm testing out the pytorch example:
And it's taking almost 20 minutes to pull so far:
And verify it's pulling:
Unless my internet is wonky today... oh no, he's chonky:
I suspect large data is stored in there (or the model) and I'm wondering if this is really a suggested practice, or a best example for JobSet. Possibly we could brainstorm ideas, if not to reduce the size of this actual container, to provide a dummy example that won't take more than a few minutes to pull. Someone testing out JobSet should have a quick way to do that.
Updated title: the resnet container is even larger!