pytorch / elastic

PyTorch elastic training
BSD 3-Clause "New" or "Revised" License
730 stars 98 forks source link

Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) #150

Closed assapin closed 3 years ago

assapin commented 3 years ago

🐛 Bug

When running the imagenet example from examples/imagenet, I get the following error:

[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group => set cuda device = 0 => creating model: resnet18 => no workers have checkpoints, starting from epoch 0 => start_epoch: 0, best_acc1: 0 Traceback (most recent call last): File "main.py", line 594, in main() File "main.py", line 183, in main train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq) File "main.py", line 455, in train acc1, acc5 = accuracy(output, target, topk=(1, 5)) File "main.py", line 588, in accuracy correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Component (check all that applies):

To Reproduce

See environment

Expected behavior

Training should work and accuracy should be reported correctly

Environment

Dockerfile:

FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime

RUN apt-get -q update && apt-get -q install -y wget unzip RUN pip install torchelastic==0.2.2

RUN mkdir ./train COPY elastic/examples/imagenet/main.py ./train WORKDIR ./train RUN chmod -R a+w . USER root ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"] CMD ["--help"]

assapin commented 3 years ago

I see you fixed it in master. Was going to do a pull request.... next time :-)