Evaluation hangs for distributed MLM task

dannyel2511 commented 2 years ago

Hi, I want to report a issue that I found while running mlm.sh for deberta-base.

Description

Using mlm.sh script for distributed training with more than 1 nodes causes a hang.
I have tracked down the issue to the merge_distributed utility method, which performs a gather operation across all ranks involved in the training.
The script hangs specifically when trying to move a tensor to GPU. This is the line data_size = [torch.zeros(data.dim(), dtype=torch.int).to(data.device) for _ in range(world_size)] where data.dim() is 1.
If I skip validation during training (using a high value for dump_interval), the issue also reproduces at the end of training when invoking evaluation.
Also, if I use --do_eval instead of --do_train the hang also happens.
The issue does NOT reproduce when using a single node with 8 GPUs. In this scenario it works as expected.

Steps to reproduce:

Run mlm.sh deberta-base using 2 nodes (8 GPUs per node).

Additional information

My system setup is:

PyTorch 1.10.2+cu113
Using nvidia-sim in any of the nodes involved in the job, I observed a utilization of 100% of GPU-Util for all of the GPUs (8).

Eunhui-Kim commented 2 years ago

I also meet the same situation. When I run in single node with many gpus, it works. However, when I use more than 2 nodes, then it stuck at the same position as @dannyel2511 indicated.

My system setup is:

Pytorch 1.10.1 +cu102

stefan-it commented 1 year ago

Hi @dannyel2511 and @Eunhui-Kim ,

I've seen this kind of stuck also in the pre-training phase. I was using different NVIDIA PyTorch containers, e.g. 21.04 version, but I wasn't able to get it running. Single-GPU is working. Did you use any container-based setup?

In my experiment, I used a single-node machine with 8 GPUs. After loading the training corpus, I could see a 100% utilization for all GPUs, and ~1GB of GPU RAM was occupied for each GPU.

Eunhui-Kim commented 1 year ago

I just make conda environment as deberta github recommend. After make conda, then I installed torch by pip. And I install deberta by using "pip install ." method in deberta source directory by using setup.py. In single node with multi-gpu, it works.

Eunhui-Kim commented 1 year ago

It works~

environment : python3.6 cuda10.2 nccl2.10.3 torch1.10.1+cu10.2

In usual, WORLD_SIZE of DP is n_gpun_nodes, however the source code of deberta calculate WORLD_SIZE=WORLD_SIZEn_gpu. Thus WORLD_SIZE is the nubmer of nodes. and MASTER_ADDR=localhost.

pvcastro commented 1 year ago

@BigBird01 I'm also having this issue with distributed mode. Any hints on how to fix/run this? This is making running distributed mode impossible. @Eunhui-Kim I didn't understand your last comment, were you able to find a way to run? What about you @stefan-it ?

Eunhui-Kim commented 1 year ago

@pvcastro actually the error was in source code. The problem was that the code broadcast after call gatherall since gatherall already include broadcast. If you remove the line 16, 17 in _utils.py (path: DeBERTa/DeBERTa/apps/_utils.py), then it will work fine. If you follow the guide for the option of the WORLD_SIZE option and MASTER_ADDR, then it works. Enjoy Deberta~ :)

adriwitek commented 6 months ago

@Eunhui-Kim The fix of setting WORLD_SIZE as the number of nodes and deleting lines 16 and 17 in DeBERTa/DeBERTa/apps/_utils.py seems not to be working for me after several experiments.

Are you using also CUDA 10.1? Maybe the error i get is related to it.

The error:

12/14/2023 08:00:34|ERROR|RTD|01| Uncatched exception happened during execution. Traceback (most recent call last): File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 475, in <module> main(args) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 316, in main train_model(args, model, device, train_data, eval_data, run_eval_fn, loss_fn=loss_fn, train_fn = train_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 108, in train_model train_fn(args, model, device, data_fn = data_fn, eval_fn = eval_fn, loss_fn = loss_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 268, in train_fn trainer.train() File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 147, in train self._train_step(batch, bs_scale) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 206, in _train_step output = self.loss_fn(self, self.model, sub) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 367, in g_loss_fn new_data, mlm_loss, gen_output = model.make_electra_data(data, rand=rand) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 98, in make_electra_data gen = self.generator_fw(**new_data) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 86, in generator_fw return self.generator(**kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) 12/14/2023 08:00:34|ERROR|RTD|02| Uncatched exception happened during execution. Traceback (most recent call last): File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 475, in <module> main(args) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 316, in main train_model(args, model, device, train_data, eval_data, run_eval_fn, loss_fn=loss_fn, train_fn = train_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 108, in train_model train_fn(args, model, device, data_fn = data_fn, eval_fn = eval_fn, loss_fn = loss_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 268, in train_fn trainer.train() File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 147, in train self._train_step(batch, bs_scale) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/models/masked_language_model.py", line 109, in forward encoder_output = self.deberta(input_ids, input_mask, type_ids, output_all_encoded_layers=True, position_ids = position_ids) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 117, in forward output_all_encoded_layers=output_all_encoded_layers, return_att = return_att) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 184, in forward attention_mask = self.get_attention_mask(attention_mask) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 206, in _train_step output = self.loss_fn(self, self.model, sub) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 367, in g_loss_fn new_data, mlm_loss, gen_output = model.make_electra_data(data, rand=rand) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 98, in make_electra_data gen = self.generator_fw(**new_data) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 86, in generator_fw return self.generator(**kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 165, in get_attention_mask attention_mask = extended_attention_mask*extended_attention_mask.squeeze(-2).unsqueeze(-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/models/masked_language_model.py", line 109, in forward encoder_output = self.deberta(input_ids, input_mask, type_ids, output_all_encoded_layers=True, position_ids = position_ids) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 117, in forward output_all_encoded_layers=output_all_encoded_layers, return_att = return_att) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 184, in forward attention_mask = self.get_attention_mask(attention_mask) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 165, in get_attention_mask attention_mask = extended_attention_mask*extended_attention_mask.squeeze(-2).unsqueeze(-1) ...

microsoft / DeBERTa