Closed zcain117 closed 3 years ago
The problem is that you are aggregating all your predictions on the TPU host, with a big evaluation set. You should use the eval_accumulation_steps
argument to pass the predictions back to the CPU every, let's say 20 evaluation steps for instance to avoid the OOM.
Thanks for the response! I started a version of the workload that uses that flag and I'll update here once it finishes the training loop
With that flag, I don't get the same OOM error. Instead I see:
E 2020-11-13T05:36:28.200219317Z 11/13/2020 05:36:28 - INFO - run_glue - *** Evaluate ***
E 2020-11-13T05:36:28.201262406Z [INFO|trainer.py:388] 2020-11-13 05:36:28,200 >> The following columns in the evaluation set don't have a corresponding argument in `XLNetForSequenceClassification.forward` and have been ignored: premise, hypothesis, idx.
E 2020-11-13T05:36:28.205409874Z [INFO|trainer.py:1387] 2020-11-13 05:36:28,204 >> ***** Running Evaluation *****
E 2020-11-13T05:36:28.205583892Z [INFO|trainer.py:1388] 2020-11-13 05:36:28,205 >> Num examples = 9815
E 2020-11-13T05:36:28.205718259Z [INFO|trainer.py:1389] 2020-11-13 05:36:28,205 >> Batch size = 32
E 2020-11-13T05:43:14.914374736Z
0%| | 0/39 [00:00<?, ?it/s]
5%|5 | 2/39 [00:10<03:14, 5.26s/it]
8%|7 | 3/39 [00:21<04:09, 6.92s/it]
10%|# | 4/39 [00:31<04:41, 8.04s/it]
13%|#2 | 5/39 [00:42<05:02, 8.89s/it]
15%|#5 | 6/39 [00:53<05:13, 9.51s/it]
18%|#7 | 7/39 [01:05<05:21, 10.04s/it]
21%|## | 8/39 [01:16<05:20, 10.34s/it]
23%|##3 | 9/39 [01:27<05:19, 10.64s/it]
26%|##5 | 10/39 [01:38<05:10, 10.71s/it]Exception in device=TPU:0: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Sent message larger than max (1342183400 vs. 1073741824) (8)
E 2020-11-13T05:43:14.914454025Z Traceback (most recent call last):
E 2020-11-13T05:43:14.914462893Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 329, in _mp_start_fn
E 2020-11-13T05:43:14.914469141Z _start_fn(index, pf_cfg, fn, args)
E 2020-11-13T05:43:14.914474634Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 323, in _start_fn
E 2020-11-13T05:43:14.914481036Z fn(gindex, *args)
E 2020-11-13T05:43:14.914486906Z File "/transformers/examples/text-classification/run_glue.py", line 414, in _mp_fn
E 2020-11-13T05:43:14.914495083Z main()
E 2020-11-13T05:43:14.914623679Z File "/transformers/examples/text-classification/run_glue.py", line 370, in main
E 2020-11-13T05:43:14.914648860Z eval_result = trainer.evaluate(eval_dataset=eval_dataset)
E 2020-11-13T05:43:14.914657065Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1313, in evaluate
E 2020-11-13T05:43:14.914667922Z prediction_loss_only=True if self.compute_metrics is None else None,
E 2020-11-13T05:43:14.914675010Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1431, in prediction_loop
E 2020-11-13T05:43:14.914681724Z preds_gatherer.add_arrays(self._gather_and_numpify(preds_host, "eval_preds"))
E 2020-11-13T05:43:14.914712087Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1474, in _gather_and_numpify
E 2020-11-13T05:43:14.914718679Z tensors = nested_xla_mesh_reduce(tensors, name)
E 2020-11-13T05:43:14.914724791Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-13T05:43:14.914731470Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914737871Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-13T05:43:14.914744687Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914751282Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-13T05:43:14.914761474Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914768115Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-13T05:43:14.914774306Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-13T05:43:14.914780896Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 113, in nested_xla_mesh_reduce
E 2020-11-13T05:43:14.914788363Z return xm.mesh_reduce(name, tensors, torch.cat)
E 2020-11-13T05:43:14.914794375Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 909, in mesh_reduce
E 2020-11-13T05:43:14.914801139Z xdata = rendezvous(tag, bio.getvalue())
E 2020-11-13T05:43:14.914806782Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
E 2020-11-13T05:43:14.914813625Z return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
E 2020-11-13T05:43:14.914819959Z RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Sent message larger than max (1342183400 vs. 1073741824) (8)
E 2020-11-13T05:43:15.468075089Z
26%|##5 | 10/39 [02:11<06:20, 13.12s/it]
I'll try some things on my side. It looks like the accumulation was fine for "eval_losses" but then failed on "eval_preds". I will just try a more frequent eval accumulation and a smaller batch size and see if that results in a smaller message being sent between TPU/CPU
It still looks like a problem of memory (from the Sent message larger than max
in the stack trace). Maybe try a lower eval_accumulation_step
?
Maybe we should move those tensors to the CPU before doing the mesh reduce to save a bit of host memory (right now they are reduced on all hosts then moved).
I have a version running now with half the accumulation size and half the eval batch size.
Memory saving on device is probably always good but in this case it seems to be complaining about the size of the transfer payload. If you don't reduce before moving, probably the size of the transfer would be even bigger
I tried with --eval_accumulation_steps 5
instead of 10 and --per_device_eval_batch_size 16
instead of 32 and ran into:
Exception in device=TPU:4: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Received message larger than max (335550440 vs. 4194304) (8)
The 335550440 number is much less than the previous error message's larger number 1342183400. I will try --eval_accumulation_steps 1
just in case but I'm wondering if this error means something else than what I was assuming
eval_accumulation_steps 1
resulted in the same error:
E 2020-11-17T00:27:24.619933766Z main()
E 2020-11-17T00:27:24.619937169Z File "/transformers/examples/text-classification/run_glue.py", line 370, in main
E 2020-11-17T00:27:24.619940804Z eval_result = trainer.evaluate(eval_dataset=eval_dataset)
E 2020-11-17T00:27:24.619944189Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1313, in evaluate
E 2020-11-17T00:27:24.619947752Z prediction_loss_only=True if self.compute_metrics is None else None,
E 2020-11-17T00:27:24.619951181Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1431, in prediction_loop
E 2020-11-17T00:27:24.619954905Z preds_gatherer.add_arrays(self._gather_and_numpify(preds_host, "eval_preds"))
E 2020-11-17T00:27:24.619958638Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer.py", line 1474, in _gather_and_numpify
E 2020-11-17T00:27:24.619962458Z tensors = nested_xla_mesh_reduce(tensors, name)
E 2020-11-17T00:27:24.619965855Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-17T00:27:24.619976695Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.619980624Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-17T00:27:24.619984750Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.619988344Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in nested_xla_mesh_reduce
E 2020-11-17T00:27:24.619992533Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.619996086Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 112, in <genexpr>
E 2020-11-17T00:27:24.619999738Z return type(tensors)(nested_xla_mesh_reduce(t, f"{name}_{i}") for i, t in enumerate(tensors))
E 2020-11-17T00:27:24.620003216Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/transformers/trainer_pt_utils.py", line 113, in nested_xla_mesh_reduce
E 2020-11-17T00:27:24.620006752Z return xm.mesh_reduce(name, tensors, torch.cat)
E 2020-11-17T00:27:24.620010015Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 909, in mesh_reduce
E 2020-11-17T00:27:24.620013568Z xdata = rendezvous(tag, bio.getvalue())
E 2020-11-17T00:27:24.620016833Z File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch_xla/core/xla_model.py", line 861, in rendezvous
E 2020-11-17T00:27:24.620020510Z return torch_xla._XLAC._xla_rendezvous(get_ordinal(), tag, payload, replicas)
E 2020-11-17T00:27:24.620024011Z RuntimeError: tensorflow/compiler/xla/xla_client/mesh_service.cc:364 : Failed to meet rendezvous 'eval_preds_1_0': Received message larger than max (67114984 vs. 4194304) (8)
It may be linked to the issue of XLNet outputing its memories on top of the logits (there is a PR under review to fix that).
That sounds plausible since this issue is only affecting xlnet and none of our other tests.
Is this the right PR: https://github.com/huggingface/transformers/pull/8567 ?
Yes this PR will fix that, but current v4 release candidate should have another fix on the Trainer
side (which basically ignores some of the keys in the model outputs).
Looks like #8567 was submitted and now our xlnet test started passing. Thank you!
Glad to hear it's fixed your issue :-)
I am running into a HBM OOM during the eval loop of xlnet (
--model_name_or_path xlnet-large-cased
) when running on TPUs. No matter which batch size I use, the behavior is the same:All the other models that we test are OK. The
xlnet-large-cased
test last passed on 2020-09-14.Since this is unrelated to batch size, I thought maybe there is a memory leak on the TPU. I think the eval loop is the more likely culprit than the training loop since the only OOM happens during eval.
Here are the last few lines of output before oom:
I'm not sure what the issue could be. It seems like both the training loop and the eval loop are using
ParallelLoader
, which should callxm.mark_step
for every call tonext
.Does anyone else have any ideas what could be happening?
Environment info
transformers
version: 3.5.0Who can help
@sgugger @LysandreJik
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
Eval loop finishes without TPU OOM.