Closed lboesen closed 1 year ago
Is it because the model can only train on multiple gpu's? or is the fix just to update train_splade.py to:
class SpladeTrainer(TevatronTrainer):
def __init__(self, *args, **kwargs):
super(SpladeTrainer, self).__init__(*args, **kwargs)
self.world_size = torch.distributed.get_world_size() if self.args.negatives_x_device else 1 # <-here
probably there is a bug. I'll take a look
Hi, this bug fixed in https://github.com/texttron/tevatron/pull/67
Thanks :)
Hi,
Im trying to train a splade model using the guidelines at (https://github.com/texttron/tevatron/tree/main/examples/splade), But I am getting following runtime error
Traceback (most recent call last): File "/home/src/tevatron/examples/splade/train_splade.py", line 135, in
main()
File "/home/src/tevatron/examples/splade/train_splade.py", line 116, in main
trainer = SpladeTrainer(
File "/home/src/tevatron/examples/splade/train_splade.py", line 31, in init
self.world_size = torch.distributed.get_world_size()
File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size
return _get_group_size(group)
File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size
default_pg = _get_default_group()
File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.
code snippet from train_splade.py:
Do you know why am I getting this error?
Thanks alot in advance :)