A question on gradient clipping

Hello! I have a question about Gradient Clipping, that arises from the following principles of privacy accounting and DP-SGD: The RDP calculation for each step in training is based on the ratio between maximum norm bound of the gradients and the std. deviation of the noise being added to them. This ratio is known as the noise multiplier. As long as the ratio stays the same, the privacy guarantee for a given real valued function does not change. So if I want to increase the maximum norm bound (sensitivity) of a real valued function, the noise std. dev. just has to be scaled by the same amount to satisfy the same privacy. (see also https://github.com/pytorch/opacus/issues/11 and Proposition 7 / Corollary 3 in https://arxiv.org/pdf/1702.07476.pdf)

Given this, I want to discuss the following example: Suppose (for the sake of simplicity) that I have chosen a norm bound of B = 1, and that the corresponding noise std. dev. sigma is also 1. The noise multiplier z = sigma/B = 1, and this real valued function then satisfies (alpha, alpha / 2 * z^2) - RDP. Consider then the following two cases during training:

If the gradient is of size 1 at a particular step in training, the noise fits the exact sensitivity of the gradient and the privacy is accounted for in a reasonable way.
However, If the gradient is less than the norm bound, let's say of size 0.5, the noise of scale 1 suddenly is too big for the now smaller sensitivity. As stated above, B and sigma could be scaled down to 0.5 as well to satisfy the same privacy guarantee as before. Worded differently, if for this step we would change B = 0.5 (which is just as valid a clipping bound as 1 and yields the same update to the gradient) but keep sigma = 1, this would satisfy a different privacy guarantee while providing the same update to the parameters (as having B=1, sigma=1). More specifically the guarantee should be equivalent to doubling the size of z resulting in alpha, alpha / 2 (2 z)^2) = (alpha, alpha / 8 * z^2)-RDP.

My question now is, is there an obvious reason as to why this is not considered in privacy accounting? It does not seem to me that the accountant takes into consideration the actual scale of gradients or scales noise accordingly. The clipping threshold and noise multiplier are constant hyperparameters that are to be freely chosen by the user of tensorflow_privacy. Because these are constant the noise that is added to the gradients is also always constant. As the sizes of the gradients most definitely are not, this leads me to believe that the privacy calculation sometimes would yield the first case for which the noise is correctly scaled and at other times the second case listed above for which the noise is not accurate (or rather always pessimistic) and that we add too much noise for a given guarantee, hurting the models utility.

Could you address this concern and whether it is possible to mitigate this using something like an adaptive clipping bound/noise during training? Or is there something I am missing? Thanks in advance!

The relation of the clip to the actual gradients' magnitudes is an important one, but there are two slight problems with the algorithm you are describing.

First, we can't use a different clip for each gradient in the batch-- the privacy guarantee comes from ratio of the noise to the worst case sensitivity. So if one gradient has norm 0.5 and another has norm 1 we still have to clip both to the same norm and add noise proportional to that.

Second, the magnitudes of the gradients are themselves private quantities, so we can't just clip everything to, say, the maximum gradient norm over all gradients in the batch-- that would require investigating the gradient norms which is not allowed. Any property of the data that influences the final result, even in a very indirect way like this, must be estimated privately.

Fortunately, there is a solution implemented in tensorflow privacy called quantile-based adaptive clipping. The technique is described by Andrew et al. (2021). In tensorflow privacy, this algorithm is implemented as QuantileAdaptiveClipSumQuery.

tensorflow / privacy

A question on gradient clipping #157