microsoft / DeBERTa

The implementation of DeBERTa
MIT License
1.91k stars 215 forks source link

Evaluation hangs for distributed MLM task #104

Open dannyel2511 opened 2 years ago

dannyel2511 commented 2 years ago

Hi, I want to report a issue that I found while running mlm.sh for deberta-base.

Description

Steps to reproduce:

Additional information

My system setup is:

Eunhui-Kim commented 2 years ago

I also meet the same situation. When I run in single node with many gpus, it works. However, when I use more than 2 nodes, then it stuck at the same position as @dannyel2511 indicated.

My system setup is:

stefan-it commented 1 year ago

Hi @dannyel2511 and @Eunhui-Kim ,

I've seen this kind of stuck also in the pre-training phase. I was using different NVIDIA PyTorch containers, e.g. 21.04 version, but I wasn't able to get it running. Single-GPU is working. Did you use any container-based setup?

In my experiment, I used a single-node machine with 8 GPUs. After loading the training corpus, I could see a 100% utilization for all GPUs, and ~1GB of GPU RAM was occupied for each GPU.

Eunhui-Kim commented 1 year ago

I just make conda environment as deberta github recommend. After make conda, then I installed torch by pip. And I install deberta by using "pip install ." method in deberta source directory by using setup.py. In single node with multi-gpu, it works.

Eunhui-Kim commented 1 year ago

It works~

environment : python3.6 cuda10.2 nccl2.10.3 torch1.10.1+cu10.2

In usual, WORLD_SIZE of DP is n_gpun_nodes, however the source code of deberta calculate WORLD_SIZE=WORLD_SIZEn_gpu. Thus WORLD_SIZE is the nubmer of nodes. and MASTER_ADDR=localhost.

pvcastro commented 1 year ago

@BigBird01 I'm also having this issue with distributed mode. Any hints on how to fix/run this? This is making running distributed mode impossible. @Eunhui-Kim I didn't understand your last comment, were you able to find a way to run? What about you @stefan-it ?

Eunhui-Kim commented 1 year ago

@pvcastro actually the error was in source code. The problem was that the code broadcast after call gatherall since gatherall already include broadcast. If you remove the line 16, 17 in _utils.py (path: DeBERTa/DeBERTa/apps/_utils.py), then it will work fine. If you follow the guide for the option of the WORLD_SIZE option and MASTER_ADDR, then it works. Enjoy Deberta~ :)

adriwitek commented 6 months ago

@Eunhui-Kim The fix of setting WORLD_SIZE as the number of nodes and deleting lines 16 and 17 in DeBERTa/DeBERTa/apps/_utils.py seems not to be working for me after several experiments.

Are you using also CUDA 10.1? Maybe the error i get is related to it.

The error:

12/14/2023 08:00:34|ERROR|RTD|01| Uncatched exception happened during execution. Traceback (most recent call last): File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 475, in <module> main(args) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 316, in main train_model(args, model, device, train_data, eval_data, run_eval_fn, loss_fn=loss_fn, train_fn = train_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 108, in train_model train_fn(args, model, device, data_fn = data_fn, eval_fn = eval_fn, loss_fn = loss_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 268, in train_fn trainer.train() File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 147, in train self._train_step(batch, bs_scale) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 206, in _train_step output = self.loss_fn(self, self.model, sub) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 367, in g_loss_fn new_data, mlm_loss, gen_output = model.make_electra_data(data, rand=rand) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 98, in make_electra_data gen = self.generator_fw(**new_data) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 86, in generator_fw return self.generator(**kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) 12/14/2023 08:00:34|ERROR|RTD|02| Uncatched exception happened during execution. Traceback (most recent call last): File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 475, in <module> main(args) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 316, in main train_model(args, model, device, train_data, eval_data, run_eval_fn, loss_fn=loss_fn, train_fn = train_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/run.py", line 108, in train_model train_fn(args, model, device, data_fn = data_fn, eval_fn = eval_fn, loss_fn = loss_fn) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 268, in train_fn trainer.train() File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 147, in train self._train_step(batch, bs_scale) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/models/masked_language_model.py", line 109, in forward encoder_output = self.deberta(input_ids, input_mask, type_ids, output_all_encoded_layers=True, position_ids = position_ids) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 117, in forward output_all_encoded_layers=output_all_encoded_layers, return_att = return_att) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 184, in forward attention_mask = self.get_attention_mask(attention_mask) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/training/trainer.py", line 206, in _train_step output = self.loss_fn(self, self.model, sub) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 367, in g_loss_fn new_data, mlm_loss, gen_output = model.make_electra_data(data, rand=rand) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 98, in make_electra_data gen = self.generator_fw(**new_data) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/tasks/rtd_task.py", line 86, in generator_fw return self.generator(**kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 165, in get_attention_mask attention_mask = extended_attention_mask*extended_attention_mask.squeeze(-2).unsqueeze(-1) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/apps/models/masked_language_model.py", line 109, in forward encoder_output = self.deberta(input_ids, input_mask, type_ids, output_all_encoded_layers=True, position_ids = position_ids) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/deberta.py", line 117, in forward output_all_encoded_layers=output_all_encoded_layers, return_att = return_att) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl return forward_call(*input, **kwargs) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 184, in forward attention_mask = self.get_attention_mask(attention_mask) File "MY_PATH_WHERE_CODE_IS/DeBERTa/venv/lib/python3.7/site-packages/DeBERTa/deberta/bert.py", line 165, in get_attention_mask attention_mask = extended_attention_mask*extended_attention_mask.squeeze(-2).unsqueeze(-1) ...