huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.82k stars 26.48k forks source link

Dropout sync across GPUs causes major performance drops #31412

Closed ri938 closed 1 week ago

ri938 commented 3 months ago

System Info

GPT2

torch==2.3.1

DDP

using transformers Trainer 4.41.2

Who can help?

@muellerzr @SunMarc (Trainer code)

@ArthurZucker and @younesbelkada (text models)

Information

Tasks

Reproduction

  1. When you set the random seed to a constant X the dropout masks will be identical across all devices being trained on with DDP
set_seed(40)

data = torch.ones_like(torch.randn(1, 10))
dropout = torch.nn.Dropout(p=0.5)

for _ in range(2):
     print({'device': os.environ.get('RANK'), 'dropout mask': dropout(data)})

On 1 GPU I get

{'device': '0', 'dropout mask': tensor([[0., 0., 2., 2., 2., 0., 2., 0., 2., 0.]])}
{'device': '0', 'dropout mask': tensor([[2., 0., 2., 2., 0., 2., 0., 2., 0., 0.]])}

On 4 GPU I get the same dropout masked being applied

{'device': '2', 'dropout mask': tensor([[0., 0., 2., 2., 2., 0., 2., 0., 2., 0.]])}
{'device': '2', 'dropout mask': tensor([[2., 0., 2., 2., 0., 2., 0., 2., 0., 0.]])}

{'device': '1', 'dropout mask': tensor([[0., 0., 2., 2., 2., 0., 2., 0., 2., 0.]])}
{'device': '1', 'dropout mask': tensor([[2., 0., 2., 2., 0., 2., 0., 2., 0., 0.]])}

{'device': '3', 'dropout mask': tensor([[0., 0., 2., 2., 2., 0., 2., 0., 2., 0.]])}
{'device': '3', 'dropout mask': tensor([[2., 0., 2., 2., 0., 2., 0., 2., 0., 0.]])}

{'device': '0', 'dropout mask': tensor([[0., 0., 2., 2., 2., 0., 2., 0., 2., 0.]])}
{'device': '0', 'dropout mask': tensor([[2., 0., 2., 2., 0., 2., 0., 2., 0., 0.]])}
  1. When training a model this leads to poor performance. See this gap which scales with the number of GPUs. If I turn dropout off then I see no gap whatsoever. This is training on 1m samples and can be observed with any seed. You see exploding gradient norms when training too.

W B Chart 11_06_2024, 22_58_50 (2)

  1. I tried to set torch.manual_seed(seed + rank) in order to vary the performance. For smaller models this fixes the issue. But it has a critical flaw which is that it breaks the data ordering and you get duplicate data on each device. Here the data row is a hash of the data being fed in

with 1 GPU the data is unique

{'data': [1416901, 1311812, 1333073, 1410935], 'losses': tensor([[0.4966],
        [0.2563],
        [0.2142],
        [0.6520]], device='cuda:0', grad_fn=<NegBackward0>), 'rank': '0', 'local_rank': '0'}

{'data': [1109164, 1386701, 1297717, 989532], 'losses': tensor([[0.0630],
        [0.0800],
        [0.0653],
        [0.7069]], device='cuda:0', grad_fn=<NegBackward0>), 'rank': '0', 'local_rank': '0'}

with 4 GPUs you get duplicate data across devices

{'data': [1416901], 'losses': tensor([[0.8843]], device='cuda:1', grad_fn=<NegBackward0>), 'rank': '1', 'local_rank': '1'}
{'data': [1416901], 'losses': tensor([[0.5725]], device='cuda:0', grad_fn=<NegBackward0>), 'rank': '0', 'local_rank': '0'}
{'data': [1109164], 'losses': tensor([[0.0828]], device='cuda:3', grad_fn=<NegBackward0>), 'rank': '3', 'local_rank': '3'}
{'data': [1010993], 'losses': tensor([[0.2223]], device='cuda:2', grad_fn=<NegBackward0>), 'rank': '2', 'local_rank': '2'}

{'data': [1325519], 'losses': tensor([[0.7159]], device='cuda:1', grad_fn=<NegBackward0>), 'rank': '1', 'local_rank': '1'}
{'data': [1360317], 'losses': tensor([[0.4719]], device='cuda:2', grad_fn=<NegBackward0>), 'rank': '2', 'local_rank': '2'}
{'data': [1284812], 'losses': tensor([[2.2623]], device='cuda:3', grad_fn=<NegBackward0>), 'rank': '3', 'local_rank': '3'}
{'data': [1109164], 'losses': tensor([[0.0876]], device='cuda:0', grad_fn=<NegBackward0>), 'rank': '0', 'local_rank': '0'}

Expected behavior

Should be a way to set the random seed to control dropout without destroying the data ordering when doing DDP.

I am happy to submit a MR to fix this issue if given some pointers about how to implement it.

ri938 commented 3 months ago

looks like the issue is that torch.manual seed is used by both nn.Dropout and by the data loader.

ri938 commented 3 months ago

it seems that the data_seed argument is not used but should be able to set the seed here for the random sampler

since we can only influence Dropout via torch.manual_seed, can I implement a change so that data_seed is used in order to seed the RandomSamper? or will this be not backwards compatible in which case I can add a new argument to do this and deprecate random seed?

younesbelkada commented 3 months ago

Hi @ri938 thanks for this interesting issue, I am not really familiar with the way accelerate sets the seed for the data sampler. I am also not sure how to do you set both the seed for dropout and the sampler in your code, could you share more details about that ?

ri938 commented 3 months ago

So I set the seed on startup to the same value "100" on each device

def set_training_seed(seed):
    from transformers import set_seed
    set_seed(seed)

this ensures that each devices has the same init of weights before training starts

then I set the seed in the TrainerArguments which gets passed to the Trainer to a constant value "100" too

trainer_args = TrainingArguments(seed=100, **kwargs)`
ri938 commented 3 months ago

the seed

torch.manual_seed(x)

is what impacts dropout. It also impacts the RandomSampler.

And therefore there is no way to ensure that dropout masks vary across devices without also breaking the data ordering on each device which requires the same seed to be set.

I would argue this is potentially an issue impacting many training runs for many users. Therefore there should be both a way to avoid this issue and also a warning message or error to prevent people training unaware of it.

RUFFY-369 commented 3 months ago

it seems that the data_seed argument is not used but should be able to set the seed here for the random sampler

Hi @ri938 you are right, the class variable data_seed is not used and set_seed is used for both data sampling and training. Please refer the discussion in the #31255 issue

ri938 commented 3 months ago

Yes, I was suggesting that if we used data_seed for the data sampling then this could be used to fix this issue. But this would break backwards compatibility.

Here is another image to illustrate the problem. When training gpt2 the gradient norms are huge when you use the same seed for each device. But when you vary the seed for each device its more sensible.

W B Chart 12_06_2024, 22_02_30 (1)

ri938 commented 3 months ago

This is the workaround I am using to fix this issue

I am adding a callback


class SeedDeviceRandomlyCallback(TrainerCallback):

    def on_train_begin(self, args, state, control, **kwargs):
        global_rank = int(os.environ['RANK'])
        new_seed = args.seed + global_rank
        print('Setting torch seed to {} on device {}'.format(new_seed, global_rank))
        torch.manual_seed(new_seed)

Because you have to set the seed to be different after get_train_dataloader has been called in order to not break data ordering.

ri938 commented 3 months ago

After applying just this one callback. This is a demonstration of how much it improved performance

W B Chart 14_06_2024, 17_51_16

W B Chart 14_06_2024, 17_51_30

ArthurZucker commented 3 months ago

Would be nice to have this merged then!

ArthurZucker commented 1 month ago

@ri938 do you want to open a PR with your proposed changes?

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.