Closed RobRomijnders closed 4 weeks ago
You are right -- gradient accumulation is currently not supported with Ghost Clipping. However, you can use Batch Memory Manager, to accumulate gradients over virtual mini-batches, which is supported with Ghost Clipping. Let us know if it addresses your requirements.
Great, thank you for clarifying that. Perhaps it should be mentioned in the documentation?
Nevertheless, one other question to double check: would ghost-clipping support training with distributed data, like DPDDP, or also not?
We plan to support it soon - if you want to contribute a PR towards it, we would be happy to help.
And yes, ghost clipping is supported with DPDDP -- essentially the optimizer needs to be changed to DistributedDPOptimizerFastGradientClipping.
Gradient accumulation and Ghost clipping
The DP optimizer for Ghost clipping enforces accumulated iterations = 1 here. The gradients, after clipping and noising, are scaled here.
However, for ghost-clipping, i.e. when
DPOptimizerFastGradientClipping
is used, in combination with gradient accumulation, this will only divide byexpected_batch_size
. Therefore, my question regards documentation:When the user uses gradient accumulation, are they expected to overwrite
DPOptimizerFastGradientClipping.accumulated_iterations
themselves?Thank you in advance, Rob