pytorch / opacus

Training PyTorch models with differential privacy
https://opacus.ai
Apache License 2.0
1.72k stars 345 forks source link

Does Gradient Accumulation work with ghost clipping? #683

Closed RobRomijnders closed 4 weeks ago

RobRomijnders commented 4 weeks ago

Gradient accumulation and Ghost clipping

The DP optimizer for Ghost clipping enforces accumulated iterations = 1 here. The gradients, after clipping and noising, are scaled here.

However, for ghost-clipping, i.e. when DPOptimizerFastGradientClipping is used, in combination with gradient accumulation, this will only divide by expected_batch_size. Therefore, my question regards documentation:

When the user uses gradient accumulation, are they expected to overwrite DPOptimizerFastGradientClipping.accumulated_iterations themselves?

Thank you in advance, Rob

EnayatUllah commented 4 weeks ago

You are right -- gradient accumulation is currently not supported with Ghost Clipping. However, you can use Batch Memory Manager, to accumulate gradients over virtual mini-batches, which is supported with Ghost Clipping. Let us know if it addresses your requirements.

RobRomijnders commented 3 weeks ago

Great, thank you for clarifying that. Perhaps it should be mentioned in the documentation?

Nevertheless, one other question to double check: would ghost-clipping support training with distributed data, like DPDDP, or also not?

EnayatUllah commented 3 weeks ago

We plan to support it soon - if you want to contribute a PR towards it, we would be happy to help.

And yes, ghost clipping is supported with DPDDP -- essentially the optimizer needs to be changed to DistributedDPOptimizerFastGradientClipping.