Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1)

🐛 Bug

When running the imagenet example from examples/imagenet, I get the following error:

[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group => set cuda device = 0 => creating model: resnet18 => no workers have checkpoints, starting from epoch 0 => start_epoch: 0, best_acc1: 0 Traceback (most recent call last): File "main.py", line 594, in main() File "main.py", line 183, in main train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq) File "main.py", line 455, in train acc1, acc5 = accuracy(output, target, topk=(1, 5)) File "main.py", line 588, in accuracy correct_k = correct[:k].view(-1).float().sum(0, keepdim=True) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

Component (check all that applies):

[ ] state api
[ ] train_step api
[ ] train_loop
[ ] rendezvous
[ ] checkpoint
[ ] rollback
[ ] metrics
[ ] petctl
[ X] examples
[ ] docker
[ ] other

To Reproduce

See environment

Expected behavior

Training should work and accuracy should be reported correctly

Environment

Dockerfile:

FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime

RUN apt-get -q update && apt-get -q install -y wget unzip RUN pip install torchelastic==0.2.2

RUN mkdir ./train COPY elastic/examples/imagenet/main.py ./train WORKDIR ./train RUN chmod -R a+w . USER root ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"] CMD ["--help"]

pytorch / elastic