Training Resnet50 on ImageNet using TPU leads to OOM after a few epochs.

jotron commented 2 years ago

🐛 Bug

I have a GCE TPU-VM setup with Pytorch 1.11. Running the provided test_train_mp_imagenet.py on the ImageNet dataset with 8 cores leads to an OOM error after a nondeterministic amount of epochs (e.g. 7 epochs).

To Reproduce

I am using a TPU-VM on GCE with software version pu-vm-pt-1.11. I have attached an SSD disk with ImageNet data.

Steps to reproduce the behavior:

Set export XRT_TPU_CONFIG='localservice;0;localhost:51011'
Clone Repo git clone https://github.com/pytorch/xla.git
Install TensorboardX pip install tensorboardX

Run Command:

python3 xla/tests/test_train_mp_imagenet.py --datadir=<path-to-data> --model=resnet50 --num_epochs=90 --num_cores=8 --num_workers=8 --batch_size=128 --log_steps=100 --logdir=logs/e90_b128

Expected behavior

I expect the the script to run for all epochs until completion. That is, logs of the form:

...
| Training Device=xla:0/5 Epoch=5 Step=1200 Loss=2.87482 Rate=200.31 GlobalRate=192.10 Time=23:57:12
| Training Device=xla:0/4 Epoch=5 Step=1200 Loss=2.86071 Rate=200.31 GlobalRate=192.10 Time=23:57:12
| Training Device=xla:0/2 Epoch=5 Step=1200 Loss=2.53118 Rate=200.30 GlobalRate=192.10 Time=23:57:12
Epoch 5 train end 23:57:42
...

until the 90th epoch. I read logs like this until epoch 7.

Error

...
| Training Device=xla:0/7 Epoch=7 Step=1200 Loss=2.47287 Rate=199.53 GlobalRate=190.44 Time=00:26:16
| Training Device=xla:0/2 Epoch=7 Step=1200 Loss=2.47380 Rate=199.53 GlobalRate=190.44 Time=00:26:16
| Training Device=xla:0/6 Epoch=7 Step=1200 Loss=2.68791 Rate=199.52 GlobalRate=190.44 Time=00:26:16
| Training Device=xla:0/4 Epoch=7 Step=1200 Loss=2.49217 Rate=199.53 GlobalRate=190.44 Time=00:26:16
Epoch 7 train end 00:26:46
| Test Device=xla:0/3 Step=0 Epoch=7 Time=00:27:01
| Test Device=xla:0/5 Step=0 Epoch=7 Time=00:27:01
| Test Device=xla:0/1 Step=0 Epoch=7 Time=00:27:01
| Test Device=xla:0/6 Step=0 Epoch=7 Time=00:27:01
| Test Device=xla:0/4 Step=0 Epoch=7 Time=00:27:01
| Test Device=xla:0/7 Step=0 Epoch=7 Time=00:27:01
| Test Device=xla:0/2 Step=0 Epoch=7 Time=00:27:02
| Test Device=xla:1/0 Step=0 Epoch=7 Time=00:27:07
Epoch 7 test end 00:27:21, Accuracy=44.20
Epoch 8 train begin 00:27:21
2022-04-04 00:27:49.308827: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
2022-04-04 00:27:49.642192: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] StackTrace:
2022-04-04 00:27:49.642244: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** Begin stack trace ***
2022-04-04 00:27:49.642250: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    tensorflow::CurrentStackTrace()
2022-04-04 00:27:49.642256: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::ReportComputationError(tensorflow::Status const&, absl::lts_20211102::Span<xla::XlaComputation const* const>, absl::lts_20211102::Span<xla::Shape const* const>)
2022-04-04 00:27:49.642266: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::ShapeHash(xla::Shape const&)
2022-04-04 00:27:49.642285: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::XrtComputationClient::ExecuteComputation(xla::ComputationClient::Computation const&, absl::lts_20211102::Span<std::shared_ptr<xla::ComputationClient::Data> const>, std::string const&, xla::ComputationClient::ExecuteComputationOptions const&)
2022-04-04 00:27:49.642291: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-04-04 00:27:49.642296: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    xla::util::MultiWait::Complete(std::function<void ()> const&)
2022-04-04 00:27:49.642302: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-04-04 00:27:49.642306: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-04-04 00:27:49.642311: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    
2022-04-04 00:27:49.642316: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]    clone
2022-04-04 00:27:49.642321: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] *** End stack trace ***
2022-04-04 00:27:49.642327: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-04-04 00:27:49.642332: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Status: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2022-04-04 00:27:49.642337: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 2 root error(s) found.
2022-04-04 00:27:49.642343: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (0) RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
2022-04-04 00:27:49.642348: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]     [[{{node XRTExecute}}]]
2022-04-04 00:27:49.642355: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2022-04-04 00:27:49.642360: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-04-04 00:27:49.642365: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]     [[XRTExecute_G10]]
2022-04-04 00:27:49.642370: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2022-04-04 00:27:49.642375: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-04-04 00:27:49.642380: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   (1) RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
2022-04-04 00:27:49.642385: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]     [[{{node XRTExecute}}]]
2022-04-04 00:27:49.642390: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.
2022-04-04 00:27:49.642395: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 
2022-04-04 00:27:49.642399: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 successful operations.
2022-04-04 00:27:49.642404: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] 0 derived errors ignored.
2022-04-04 00:27:49.642409: E tensorflow/compiler/xla/xla_client/xla_util.cc:88] Recent warning and error logs:
2022-04-04 00:27:49.642414: E tensorflow/compiler/xla/xla_client/xla_util.cc:88]   OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
Exception in device=TPU:4: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
  (0) RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
     [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

     [[XRTExecute_G10]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

  (1) RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
     [[{{node XRTExecute}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode.

0 successful operations.
0 derived errors ignored.
Recent warning and error logs:
  OP_REQUIRES failed at tpu_execute_op.cc:266 : RESOURCE_EXHAUSTED: Attempting to reserve 6.48G at the bottom of memory. That was not possible. There are 6.78G free, 0B reserved, and 3.33G reservable.
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 330, in _mp_start_fn
    _start_fn(index, pf_cfg, fn, args)
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/xla_multiprocessing.py", line 324, in _start_fn
    fn(gindex, *args)
  File "/home/joelandre/test_train_mp_imagenet.py", line 264, in _mp_fn
    accuracy = train_imagenet()
  File "/home/joelandre/test_train_mp_imagenet.py", line 240, in train_imagenet
    train_loop_fn(train_device_loader, epoch)
  File "/home/joelandre/test_train_mp_imagenet.py", line 207, in train_loop_fn
    for step, (data, target) in enumerate(loader):
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/parallel_loader.py", line 34, in __next__
    return self.next()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/distributed/parallel_loader.py", line 46, in next
    xm.mark_step()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 792, in mark_step
    devctx = _run_step_closures()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 778, in _run_step_closures
    closure()
  File "/usr/local/lib/python3.8/dist-packages/torch_xla/core/xla_model.py", line 760, in <lambda>
    step_closures.append(lambda a=args: closure(*a))
  File "/home/joelandre/test_train_mp_imagenet.py", line 107, in _train_update
    loss.item(),

RuntimeError: RESOURCE_EXHAUSTED: From /job:localservice/replica:0/task:0:
2 root error(s) found.
...

The full error log is logfull.txt Additionally I ran:

./xla/scripts/debug_run.py --outfile logs/debug_run.tar.gz -- python3 -u test_train_mp_mnist.py --num_epochs=1 --num_cores=8 --num_workers=8

The archive produced can be found on google drive.

miaoshasha commented 2 years ago

Thanks @jotron! I was able to reproduce this error. Would you please use 1.10 for now, and we'll fix this asap and let you know!

jotron commented 2 years ago

Thank you for taking care of it @miaoshasha! I wouldn't exclude the bug happening in 1.10 as well though. I can't rerun it right now, but I remember it happening to me in 1.10 as well, but after 50 Epochs. Any progress?

JackCaoG commented 2 years ago

debugging is in progress, I will report back when I have some better idea.

miladm commented 2 years ago

Thanks for reporting this bug. I will work on it later in the week.

pytorch / xla