microsoft / analysing_pii_leakage

The repository contains the code for analysing the leakage of personally identifiable (PII) information from the output of next word prediction language models.
MIT License
74 stars 17 forks source link

No distributed training support for DP training? #8

Open vatsal-kr opened 10 months ago

vatsal-kr commented 10 months ago

Hello When I run the code with two GPUs, I get the following error

Traceback (most recent call last):
  File "/home/huntsman/repos/analysing_pii_leakage/examples/fine_tune.py", line 82, in <module>
    fine_tune(*parse_args())
  File "/home/huntsman/repos/analysing_pii_leakage/examples/fine_tune.py", line 74, in fine_tune
    lm.fine_tune(train_dataset, eval_dataset, train_args, privacy_args)
  File "/home/huntsman/repos/analysing_pii_leakage/src/pii_leakage/models/language_model.py", line 290, in fine_tune
    return self._fine_tune_dp(train_dataset, eval_dataset, train_args, privacy_args)
  File "/home/huntsman/repos/analysing_pii_leakage/src/pii_leakage/models/language_model.py", line 273, in _fine_tune_dp
    trainer.train()
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/dp_transformers/dp_utils.py", line 266, in training_step
    loss.backward()
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/functorch/_src/monkey_patching.py", line 77, in _backward
    return _old_backward(*args, **kwargs)
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/torch/_tensor.py", line 396, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py", line 310, in capture_backprops_hook
    activations, backprops = self.rearrange_grad_samples(
  File "/home/huntsman/anaconda3/envs/pii/lib/python3.10/site-packages/opacus/grad_sample/grad_sample_module.py", line 358, in rearrange_grad_samples
    raise ValueError(
ValueError: No activations detected for <class 'torch.nn.modules.linear.Linear'>, run forward after add_hooks(model)

This works fine with a single GPU though. Any suggestions?

s-zanella commented 7 months ago

How exactly are you running the fine_tune.py script for distributed training, and what type of distributed training do you want to achieve?

Opacus doesn't support model sharding using DeepSpeed or FSDP. It does support DDP, but you would still need for the model to fit in each individual GPU. Furthermore, the code needs to be adapted to use DDP through dp_transformers. See the args.parallel_model argument in https://github.com/microsoft/dp-transformers/blob/main/src/dp_transformers/dp_utils.py#L171.