When running the imagenet example from examples/imagenet,
I get the following error:
[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group
=> set cuda device = 0
=> creating model: resnet18
=> no workers have checkpoints, starting from epoch 0
=> start_epoch: 0, best_acc1: 0
Traceback (most recent call last):
File "main.py", line 594, in
main()
File "main.py", line 183, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "main.py", line 455, in train
acc1, acc5 = accuracy(output, target, topk=(1, 5))
File "main.py", line 588, in accuracy
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Component (check all that applies):
[ ] state api
[ ] train_step api
[ ] train_loop
[ ] rendezvous
[ ] checkpoint
[ ] rollback
[ ] metrics
[ ] petctl
[ X] examples
[ ] docker
[ ] other
To Reproduce
See environment
Expected behavior
Training should work and accuracy should be reported correctly
Environment
Dockerfile:
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
RUN apt-get -q update && apt-get -q install -y wget unzip
RUN pip install torchelastic==0.2.2
RUN mkdir ./train
COPY elastic/examples/imagenet/main.py ./train
WORKDIR ./train
RUN chmod -R a+w .
USER root
ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"]
CMD ["--help"]
🐛 Bug
When running the imagenet example from examples/imagenet, I get the following error:
[INFO] 2021-05-30 13:09:18,531 api: [default] Starting worker group => set cuda device = 0 => creating model: resnet18 => no workers have checkpoints, starting from epoch 0 => start_epoch: 0, best_acc1: 0 Traceback (most recent call last): File "main.py", line 594, in
main()
File "main.py", line 183, in main
train(train_loader, model, criterion, optimizer, epoch, device_id, print_freq)
File "main.py", line 455, in train
acc1, acc5 = accuracy(output, target, topk=(1, 5))
File "main.py", line 588, in accuracy
correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.
Component (check all that applies):
state api
train_step api
train_loop
rendezvous
checkpoint
rollback
metrics
petctl
examples
docker
To Reproduce
See environment
Expected behavior
Training should work and accuracy should be reported correctly
Environment
Dockerfile:
FROM pytorch/pytorch:1.8.1-cuda11.1-cudnn8-runtime
RUN apt-get -q update && apt-get -q install -y wget unzip RUN pip install torchelastic==0.2.2
RUN mkdir ./train COPY elastic/examples/imagenet/main.py ./train WORKDIR ./train RUN chmod -R a+w . USER root ENTRYPOINT ["python", "-m", "torchelastic.distributed.launch"] CMD ["--help"]