kubeflow / pytorch-operator

PyTorch on Kubernetes
Apache License 2.0
306 stars 143 forks source link

Mnist dataset server is down #325

Open Jeffwan opened 3 years ago

Jeffwan commented 3 years ago

E2e test is down. Reason is straightforwad that server report 503 issue and I did some check and notice this has been tracked in torch community.

As the patch is only available on master and there's no way to specify the download path. I can try to either disable that single test case and wait for stable release or build a nightly image which takes extra efforts

Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 123, in main
    transforms.Normalize((0.1307,), (0.3081,))
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 46, in __init__
    epoch, batch_idx * len(data), len(train_loader.dataset),
  File "/opt/conda/lib/python3.6/site-packages/torchvision-0.2.1-py3.6.egg/torchvision/datasets/mnist.py", line 114, in download
    if should_distribute():
  File "/opt/conda/lib/python3.6/urllib/request.py", line 223, in urlopen
    return opener.open(url, data, timeout)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 532, in open
    response = meth(req, response)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 642, in http_response
    'http', request, response, code, msg, hdrs)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 570, in error
    return self._call_chain(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 504, in _call_chain
    result = func(*args)
  File "/opt/conda/lib/python3.6/urllib/request.py", line 650, in http_error_default
    raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 503: Service Unavailable

Confirmed this is a server side issue.

https://discuss.pytorch.org/t/mnist-server-down/114433 https://github.com/pytorch/vision/issues/3554

andreyvelich commented 3 years ago

@Jeffwan We faced with the same problem in Katib. We currently using FashionMNIST instead of MNIST: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L137. I believe it hosts in the PyTorch servers.

yanniszark commented 3 years ago

@andreyvelich this sounds like a good solution. Another way would be to pre-download the dataset in the image. The problem is how to make a new image for the example. The current one is from the GCP registry, which is no longer available.

Jeffwan commented 3 years ago

@Jeffwan We faced with the same problem in Katib. We currently using FashionMNIST instead of MNIST: https://github.com/kubeflow/katib/blob/master/examples/v1beta1/pytorch-mnist/mnist.py#L137. I believe it hosts in the PyTorch servers.

Sounds good. Let me double check if the code is compatible with FashionMnist dataset. If it is and data server is reliable. We can quickly change to it.

Jeffwan commented 3 years ago

Code has been changed https://github.com/kubeflow/pytorch-operator/pull/327 We need a better way to publish images. This can be done after 1.3 release

umka1332 commented 3 years ago

Code has been changed #327 We need a better way to publish images. This can be done after 1.3 release

Hi @Jeffwan Kubeflow 1.3 is already released. Is there any progress on this?