Closed vanbasten23 closed 2 years ago
can you get resnet50 running with fakedata on the same TPUVM? Want to make sure this is not caused by a setup issue.
In the same red vm, running resnet50 with fakedata succeeds:
(base) xiowei@xioweicloudtop1:~/pytorch/xla$ gcloud compute ssh xiowei-dlrm-tutorial --zone=us-central1-a
Last login: Mon Oct 24 17:20:00 2022 from 216.239.45.216
xiowei@xiowei-dlrm-tutorial:~$ conda activate torch-xla-1.13
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ export TPU_IP_ADDRESS=10.35.41.18
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ export XRT_TPU_CONFIG="tpu_worker;0;$TPU_IP_ADDRESS:8470"
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ git clone --recursive https://github.com/pytorch/xla.git
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$ python xla/test/test_train_mp_imagenet.py --fake_data --model=resnet50 --num_epochs=1
==> Preparing data..
Epoch 1 train begin 18:25:04
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
==> Preparing data..
| Training Device=xla:0/7 Epoch=1 Step=0 Loss=6.89059 Rate=4.06 GlobalRate=4.06 Time=18:25:39
| Training Device=xla:0/1 Epoch=1 Step=0 Loss=6.89059 Rate=4.09 GlobalRate=4.09 Time=18:25:39
| Training Device=xla:0/3 Epoch=1 Step=0 Loss=6.89059 Rate=3.99 GlobalRate=3.99 Time=18:25:39
| Training Device=xla:0/6 Epoch=1 Step=0 Loss=6.89059 Rate=3.93 GlobalRate=3.93 Time=18:25:39
| Training Device=xla:0/4 Epoch=1 Step=0 Loss=6.89059 Rate=3.97 GlobalRate=3.97 Time=18:25:39
| Training Device=xla:0/5 Epoch=1 Step=0 Loss=6.89059 Rate=4.11 GlobalRate=4.11 Time=18:25:39
...
| Training Device=xla:0/3 Epoch=1 Step=1160 Loss=0.00136 Rate=590.95 GlobalRate=452.56 Time=18:30:35
Epoch 1 train end 18:30:38
| Test Device=xla:0/3 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/7 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:1/0 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/2 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/4 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/6 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/1 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/5 Step=0 Epoch=1 Time=18:30:42
| Test Device=xla:0/4 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/1 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/3 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/2 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:1/0 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/5 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/7 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/6 Step=20 Epoch=1 Time=18:30:48
| Test Device=xla:0/1 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/3 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:1/0 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/6 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/2 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/5 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/7 Step=40 Epoch=1 Time=18:30:49
| Test Device=xla:0/4 Step=40 Epoch=1 Time=18:30:49
Epoch 1 test end 18:30:49, Accuracy=100.00
Max Accuracy: 100.00%
(torch-xla-1.13) xiowei@xiowei-dlrm-tutorial:~$
π Bug
In the r1.13 release 2vm image, the dlrm test crashes. The error I got is:
To Reproduce
In project tpu-pytorch, search for the 2vm red image
xiowei-2vm-red-image
. Use it to create a red VM. Inside the VM, doExpected behavior
It should succeed.
Environment
Additional context