TPU issue: possible memory leak in eval loop

zcain117 commented 4 years ago

I am running into a HBM OOM during the eval loop of xlnet (--model_name_or_path xlnet-large-cased) when running on TPUs. No matter which batch size I use, the behavior is the same:

training loop succeeds
eval loop starts, makes it about halfway, then the TPU runs out of HBM memory and eval loop dies

All the other models that we test are OK. The xlnet-large-cased test last passed on 2020-09-14.

Since this is unrelated to batch size, I thought maybe there is a memory leak on the TPU. I think the eval loop is the more likely culprit than the training loop since the only OOM happens during eval.

Here are the last few lines of output before oom:

E 2020-11-12T04:51:27.984001264Z Saving model checkpoint to MNLI
E 2020-11-12T04:51:27.989368910Z Configuration saved in MNLI/config.json
E 2020-11-12T04:51:40.438957029Z Model weights saved in MNLI/pytorch_model.bin
E 2020-11-12T04:51:40.535782031Z 11/12/2020 04:51:40 - INFO - run_glue -   *** Evaluate ***
E 2020-11-12T04:51:40.536480018Z The following columns in the evaluation set don't have a corresponding argument in `XLNetForSequenceClassification.forward` and have been ignored: idx, hypothesis, premise.
E 2020-11-12T04:51:40.540513400Z ***** Running Evaluation *****
E 2020-11-12T04:51:40.540566285Z   Num examples = 9815
E 2020-11-12T04:51:40.540575559Z   Batch size = 8
E 2020-11-12T05:11:26.995136217Z 
  0%|          | 0/154 [00:00<?, ?it/s]
  1%|1         | 2/154 [00:11<14:01,  5.53s/it]
  2%|1         | 3/154 [00:22<18:15,  7.25s/it]
...
 49%|####9     | 76/154 [14:34<15:49, 12.17s/it]
 50%|#####     | 77/154 [14:48<16:25, 12.80s/it]2020-11-12 05:11:26.994477: E     511 tensorflow/compiler/xla/xla_client/xla_util.cc:76] >>> Dumping Computation 0

I'm not sure what the issue could be. It seems like both the training loop and the eval loop are using ParallelLoader, which should call xm.mark_step for every call to next.

Does anyone else have any ideas what could be happening?

Environment info

transformers version: 3.5.0
Platform: Linux-4.9.0-13-amd64-x86_64-with-debian-9.13
Python version: 3.6.10
PyTorch version (GPU?): 1.8.0a0+d0df29a (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: No
Using distributed or parallel set-up in script?: Yes
Using TPU in script?: Yes

Who can help

@sgugger @LysandreJik

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[x] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

The tasks I am working on is:

[x] an official GLUE/SQUaD task: MNLI
[ ] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

git clone https://github.com/huggingface/transformers.git
cd transformers && pip install .
pip install datasets
Training command:

              python examples/xla_spawn.py \
                --num_cores 8 \
                examples/text-classification/run_glue.py \
                --logging_dir=./tensorboard-metrics \
                --task_name MNLI \
                --cache_dir ./cache_dir \
                --do_train \
                --do_eval \
                --num_train_epochs 3 \
                --max_seq_length 128 \
                --learning_rate 3e-5 \
                --output_dir MNLI \
                --overwrite_output_dir \
                --logging_steps 100 \
                --save_steps 3000 \
                --overwrite_cache \
                --tpu_metrics_debug \
               --model_name_or_path xlnet-large-cased \
              --per_device_train_batch_size 32 \
              --per_device_eval_batch_size 8

Expected behavior

Eval loop finishes without TPU OOM.

sgugger commented 4 years ago

The problem is that you are aggregating all your predictions on the TPU host, with a big evaluation set. You should use the eval_accumulation_steps argument to pass the predictions back to the CPU every, let's say 20 evaluation steps for instance to avoid the OOM.

zcain117 commented 4 years ago

Thanks for the response! I started a version of the workload that uses that flag and I'll update here once it finishes the training loop

zcain117 commented 4 years ago

With that flag, I don't get the same OOM error. Instead I see:

E 2020-11-13T05:36:28.200219317Z 11/13/2020 05:36:28 - INFO - run_glue -   *** Evaluate ***
E 2020-11-13T05:36:28.201262406Z [INFO|trainer.py:388] 2020-11-13 05:36:28,200 >> The following columns in the evaluation set don't have a corresponding argument in `XLNetForSequenceClassification.forward` and have been ignored: premise, hypothesis, idx.
E 2020-11-13T05:36:28.205409874Z [INFO|trainer.py:1387] 2020-11-13 05:36:28,204 >> ***** Running Evaluation *****
E 2020-11-13T05:36:28.205583892Z [INFO|trainer.py:1388] 2020-11-13 05:36:28,205 >>   Num examples = 9815
E 2020-11-13T05:36:28.205718259Z [INFO|trainer.py:1389] 2020-11-13 05:36:28,205 >>   Batch size = 32
E 2020-11-13T05:43:14.914374736Z 
  0%|          | 0/39 [00:00<?, ?it/s]
  5%|5         | 2/39 [00:10<03:14,  5.26s/it]
  8%|7         | 3/39 [00:21<04:09,  6.92s/it]
 10%|#         | 4/39 [00:31<04:41,  8.04s/it]
 13%|#2        | 5/39 [00:42<05:02,  8.89s/it]
 15%|#5        | 6/39 [00:53<05:13,  9.51s/it]
 18%|#7        | 7/39 [01:05<05:21, 10.04s/it]
 21%|##        | 8/39 [01:16<05:20, 10.34s/it]
 23%|##3       | 9/39 [01:27<05:19, 10.64s/it]
 26%|##5       | 10/39 [01:38<05:10, 10.71s/it]Exception in device=TPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Sent message larger than max (1342183400 vs. 1073741824) (8)
E 2020-11-13T05:43:14.914454025Z Traceback (most recent call last):
E 2020-11-13T05:43:14.914462893Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
E 2020-11-13T05:43:14.914469141Z     _start_fn(index, pf_cfg, fn, args)
E 2020-11-13T05:43:14.914474634Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
E 2020-11-13T05:43:14.914481036Z     fn(gindex, *args)
E 2020-11-13T05:43:14.914486906Z   File "/transformers/examples/text-classification/run_glue.py", line 414, in _mp_fn
E 2020-11-13T05:43:14.914495083Z     main()
E 2020-11-13T05:43:14.914623679Z   File "/transformers/examples/text-classification/run_glue.py", line 370, in main
E 2020-11-13T05:43:14.914648860Z     eval_result = trainer.evaluate(eval_dataset=eval_dataset)
E 2020-11-13T05:43:14.914657065Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1313, in evaluate
E 2020-11-13T05:43:14.914667922Z     prediction_loss_only=True if self.compute_metrics is None else None,
E 2020-11-13T05:43:14.914675010Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1431, in prediction_loop
E 2020-11-13T05:43:14.914681724Z     preds_gatherer.add_arrays(self._gather_and_numpify(preds_host, "eval_preds"))
E 2020-11-13T05:43:14.914712087Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1474, in _gather_and_numpify
E 2020-11-13T05:43:14.914718679Z     tensors = nested_xla_mesh_reduce(tensors, name)
E 2020-11-13T05:43:14.914724791Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-13T05:43:14.914731470Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914737871Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-13T05:43:14.914744687Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914751282Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-13T05:43:14.914761474Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914768115Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-13T05:43:14.914774306Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914780896Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 113, in nested_xla_mesh_reduce
E 2020-11-13T05:43:14.914788363Z     return xm.mesh_reduce(name, tensors, torch.cat)
E 2020-11-13T05:43:14.914794375Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 909, in mesh_reduce
E 2020-11-13T05:43:14.914801139Z     xdata = rendezvous(tag, bio.getvalue())
E 2020-11-13T05:43:14.914806782Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
E 2020-11-13T05:43:14.914813625Z     return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
E 2020-11-13T05:43:14.914819959Z RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Sent message larger than max (1342183400 vs. 1073741824) (8)
E 2020-11-13T05:43:15.468075089Z 
 26%|##5       | 10/39 [02:11<06:20, 13.12s/it]

I'll try some things on my side. It looks like the accumulation was fine for "eval_losses" but then failed on "eval_preds". I will just try a more frequent eval accumulation and a smaller batch size and see if that results in a smaller message being sent between TPU/CPU

sgugger commented 4 years ago

It still looks like a problem of memory (from the Sent message larger than max in the stack trace). Maybe try a lower eval_accumulation_step?

Maybe we should move those tensors to the CPU before doing the mesh reduce to save a bit of host memory (right now they are reduced on all hosts then moved).

zcain117 commented 4 years ago

I have a version running now with half the accumulation size and half the eval batch size.

Memory saving on device is probably always good but in this case it seems to be complaining about the size of the transfer payload. If you don't reduce before moving, probably the size of the transfer would be even bigger

zcain117 commented 3 years ago

I tried with --eval_accumulation_steps 5 instead of 10 and --per_device_eval_batch_size 16 instead of 32 and ran into:

Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Received message larger than max (335550440 vs. 4194304) (8)

The 335550440 number is much less than the previous error message's larger number 1342183400. I will try --eval_accumulation_steps 1 just in case but I'm wondering if this error means something else than what I was assuming

zcain117 commented 3 years ago

eval_accumulation_steps 1 resulted in the same error:

E 2020-11-17T00:27:24.619933766Z     main()
E 2020-11-17T00:27:24.619937169Z   File "/transformers/examples/text-classification/run_glue.py", line 370, in main
E 2020-11-17T00:27:24.619940804Z     eval_result = trainer.evaluate(eval_dataset=eval_dataset)
E 2020-11-17T00:27:24.619944189Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1313, in evaluate
E 2020-11-17T00:27:24.619947752Z     prediction_loss_only=True if self.compute_metrics is None else None,
E 2020-11-17T00:27:24.619951181Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1431, in prediction_loop
E 2020-11-17T00:27:24.619954905Z     preds_gatherer.add_arrays(self._gather_and_numpify(preds_host, "eval_preds"))
E 2020-11-17T00:27:24.619958638Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1474, in _gather_and_numpify
E 2020-11-17T00:27:24.619962458Z     tensors = nested_xla_mesh_reduce(tensors, name)
E 2020-11-17T00:27:24.619965855Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-17T00:27:24.619976695Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.619980624Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-17T00:27:24.619984750Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.619988344Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-17T00:27:24.619992533Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.619996086Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-17T00:27:24.619999738Z     return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.620003216Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 113, in nested_xla_mesh_reduce
E 2020-11-17T00:27:24.620006752Z     return xm.mesh_reduce(name, tensors, torch.cat)
E 2020-11-17T00:27:24.620010015Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 909, in mesh_reduce
E 2020-11-17T00:27:24.620013568Z     xdata = rendezvous(tag, bio.getvalue())
E 2020-11-17T00:27:24.620016833Z   File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
E 2020-11-17T00:27:24.620020510Z     return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
E 2020-11-17T00:27:24.620024011Z RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Received message larger than max (67114984 vs. 4194304) (8)

sgugger commented 3 years ago

It may be linked to the issue of XLNet outputing its memories on top of the logits (there is a PR under review to fix that).

zcain117 commented 3 years ago

That sounds plausible since this issue is only affecting xlnet and none of our other tests.

Is this the right PR: https://github.com/huggingface/transformers/pull/8567 ?

sgugger commented 3 years ago

Yes this PR will fix that, but current v4 release candidate should have another fix on the Trainer side (which basically ignores some of the keys in the model outputs).

zcain117 commented 3 years ago

Looks like #8567 was submitted and now our xlnet test started passing. Thank you!

sgugger commented 3 years ago

Glad to hear it's fixed your issue :-)

huggingface / transformers