texttron / tevatron

Tevatron - A flexible toolkit for neural retrieval research and development.
http://tevatron.ai
Apache License 2.0
494 stars 94 forks source link

RunTimeError when training SPLADE - .get_world_size() issues #63

Closed lboesen closed 1 year ago

lboesen commented 1 year ago

Hi,

Im trying to train a splade model using the guidelines at (https://github.com/texttron/tevatron/tree/main/examples/splade), But I am getting following runtime error

Traceback (most recent call last): File "/home/src/tevatron/examples/splade/train_splade.py", line 135, in main() File "/home/src/tevatron/examples/splade/train_splade.py", line 116, in main trainer = SpladeTrainer( File "/home/src/tevatron/examples/splade/train_splade.py", line 31, in init self.world_size = torch.distributed.get_world_size() File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 867, in get_world_size return _get_group_size(group) File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 325, in _get_group_size default_pg = _get_default_group() File "/home/anaconda3/envs/myenv/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group raise RuntimeError( RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

code snippet from train_splade.py:

class SpladeTrainer(TevatronTrainer):
    def __init__(self, *args, **kwargs):
        super(SpladeTrainer, self).__init__(*args, **kwargs)
        self.world_size = torch.distributed.get_world_size()

Do you know why am I getting this error?

Thanks alot in advance :)

lboesen commented 1 year ago

Is it because the model can only train on multiple gpu's? or is the fix just to update train_splade.py to:

class SpladeTrainer(TevatronTrainer):
    def __init__(self, *args, **kwargs):
        super(SpladeTrainer, self).__init__(*args, **kwargs)
        self.world_size = torch.distributed.get_world_size() if self.args.negatives_x_device else 1 # <-here 
MXueguang commented 1 year ago

probably there is a bug. I'll take a look

MXueguang commented 1 year ago

Hi, this bug fixed in https://github.com/texttron/tevatron/pull/67

lboesen commented 1 year ago

Thanks :)