Open ChidanandKumarKS opened 1 year ago
I'm facing the same issue. Training was possible on 1 GPU, but error in evaluation.
I got the same problem and fixed it by changing"self.args.local_rank = torch.distributed.get_rank()" to "self.args.local_rank = -1" (xfun_trainer.py, line 178)
Describe the bug Model I am using (UniLM, MiniLM, LayoutLM ...):
The problem arises when using:
A clear and concise description of what the bug is.
To Reproduce Steps to reproduce the behavior:
Expected behavior A clear and concise description of what you expected to happen.
Logs: File "examples/run_xfun_re.py", line 245, in
main()
File "examples/run_xfun_re.py", line 230, in main
metrics = trainer.evaluate()
File "/home/chowkam/chowkamWkspc/unilm-master/layoutlmft/layoutlmft/trainers/xfun_trainer.py", line 178, in evaluate
self.args.local_rank = torch.distributed.get_rank()
File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 822, in get_rank
default_pg = _get_default_group()
File "/home/chowkam/anaconda3/envs/chowkam/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 411, in _get_default_group
"Default process group has not been initialized, "
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.