Error in Training on systems with only one GPU:

ChidanandKumarKS commented 1 year ago

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):

The problem arises when using:

[ ] the official example scripts: (give details below)
[ ] my own modified scripts: (give details below)

A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

python examples/run_xfun_re.py --model_name_or_path microsoft/layoutxlm-base --output_dir /tmp/test-ner --do_train --do_eval --lang zh --max_steps 2500 --per_device_train_batch_size 2 --warmup_ratio 0.1 --fp16

Expected behavior A clear and concise description of what you expected to happen.

Platform: RTX 3090, Ubuntu 20.04
Python version: 3.7
PyTorch version (GPU?): 1.10

Logs: File "examples/run_xfun_re.py", line 245, in main() File "examples/run_xfun_re.py", line 230, in main metrics = trainer.evaluate() File "/home/chowkam/chowkamWkspc/unilm-master/layoutlmft/layoutlmft/trainers/xfun_trainer.py", line 178, in evaluate self.args.local_rank = torch.distributed.get_rank() File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank default_pg = _get_default_group() File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 411, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

PritikaRamu commented 1 year ago

I'm facing the same issue. Training was possible on 1 GPU, but error in evaluation.

hiijar commented 1 year ago

I got the same problem and fixed it by changing"self.args.local_rank = torch.distributed.get_rank()" to "self.args.local_rank = -1" (xfun_trainer.py, line 178)

microsoft / unilm

Error in Training on systems with only one GPU: #932