kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

Problem with "pytorch-dist-mnist-test:v1.0" image in example notebook "create-pytorchjob.ipynb" #2266

Open saileshd1402 opened 2 months ago

saileshd1402 commented 2 months ago

What happened?

When I run the examples/pytorch/image-classification/create-pytorchjob.ipynb file, the "pytorch-dist-mnist-test:v1.0" image is using https://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz url to download the mnist training dataset, but url is currently not working

Error:

 Defaulted container "pytorch" out of: pytorch, init-pytorch (init)
Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 123, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
    epoch, batch_idx * len(data), len(train_loader.dataset),
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
    if should_distribute():
  File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden

There is seems to be the same dataset hosted at https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz which can be replaced with.

ref: https://github.com/pytorch/vision/blob/6d7851bd5e2bedc294e40e90532f0e375fcfee04/torchvision/datasets/mnist.py#L39

What did you expect to happen?

Ideally "pytorch-dist-mnist-test:v1.0" image should be updated or should provide a replacement image

Environment

Kubernetes version:

$ kubectl version

Client Version: v1.30.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.5

Training Operator version:

$ kubectl get pods -n kubeflow -l control-plane=kubeflow-training-operator -o jsonpath="{.items[*].spec.containers[*].image}"

kubeflow/training-operator:latest

Training Operator Python SDK version:

$ pip show kubeflow-training

Name: kubeflow-training
Version: 1.8.1
Summary: Training Operator Python SDK
Home-page: https://github.com/kubeflow/training-operator/tree/master/sdk/python
Author: Kubeflow Authors
Author-email: hejinchi@cn.ibm.com
License: Apache License Version 2.0
Location: /home/ubuntu/.kflowenv/lib/python3.11/site-packages
Requires: certifi, kubernetes, retrying, setuptools, six, urllib3
Required-by: 

Impacted by this bug?

Give it a šŸ‘ We prioritize the issues with most šŸ‘

YosiElias commented 2 weeks ago

/assign