microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

Error in Training on systems with only one GPU: #932

Open ChidanandKumarKS opened 1 year ago

ChidanandKumarKS commented 1 year ago

Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):

The problem arises when using:

A clear and concise description of what the bug is.

To Reproduce Steps to reproduce the behavior:

  1. python examples/run_xfun_re.py --model_name_or_path microsoft/layoutxlm-base --output_dir /tmp/test-ner --do_train --do_eval --lang zh --max_steps 2500 --per_device_train_batch_size 2 --warmup_ratio 0.1 --fp16

Expected behavior A clear and concise description of what you expected to happen.

Logs: File "examples/run_xfun_re.py", line 245, in main() File "examples/run_xfun_re.py", line 230, in main metrics = trainer.evaluate() File "/home/chowkam/chowkamWkspc/unilm-master/layoutlmft/layoutlmft/trainers/xfun_trainer.py", line 178, in evaluate self.args.local_rank = torch.distributed.get_rank() File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank default_pg = _get_default_group() File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 411, in _get_default_group "Default process group has not been initialized, " RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

PritikaRamu commented 1 year ago

I'm facing the same issue. Training was possible on 1 GPU, but error in evaluation.

hiijar commented 1 year ago

I got the same problem and fixed it by changing"self.args.local_rank = torch.distributed.get_rank()" to "self.args.local_rank = -1" (xfun_trainer.py, line 178)