[differential_privacy] Learning rates used for Adaptive Clipping experiments

VasundharaAgarwal commented 2 years ago

Hi,

I am trying to reproduce the experiments in "Differentially Private Learning with Adaptive Clipping" (2021), the source code for which is provided under federated/differential_privacy. The paper does not report the final server learning rates used for DP-FedAvgM with clipping enabled. It simply states the following in Section 3.1 -

Therefore, for all approaches with clipping—fixed or adaptive—we search over a small grid of five server learning rates, scaling the values in Table 1 by {1, 10^1/4, 10^1/2, 10^3/4, 10}. For all configurations, we report the best performing model whose server learning rate was chosen from this small grid on the validation set.

It is not computationally feasible for me to search for the optimal server lr in every possible configuration so I was hoping you could specify the learning rates that were used for training the best performing models. Thank you.

kairouzp commented 2 years ago

Using a server learning rate = 1 should be just as good as anything else. I believe that's what we observed.

VasundharaAgarwal commented 2 years ago

Using a server learning rate = 1 should be just as good as anything else. I believe that's what we observed.

Thanks! Should I use that value for all clipping quantiles and noise multipliers then?

kairouzp commented 2 years ago

My impression is that yes, this should work well for all noise multipliers and clipping quantiles. I will check with the paper authors and get back to you on this one, but that's what I have observed in my own experiments.

VasundharaAgarwal commented 2 years ago

My impression is that yes, this should work well for all noise multipliers and clipping quantiles. I will check with the paper authors and get back to you on this one, but that's what I have observed in my own experiments.

Thanks, that would be great! :)

galenmandrew commented 2 years ago

Hello. Peter is correct that a server learning rate of 1 is generally fine and you shouldn't expect significant gains from optimizing it. However in the paper we did experiment with different learning rates to account for the impact of clipping. I can provide the optimal values we used here.

For each task there is the optimal server learning rate (SLR) with fixed clipping and with adaptive clipping to the median. The values below are the log base 10 of the SLR chosen from the development set. Note that for fixed I am giving you the optimal SLR with the best fixed clip C* as shown in Figure 7, while for adaptive I am giving you the optimal SLR with clipping to the median. (Different fixed clips would have different optimal SLRs.)

CIFAR-100 fixed: -0.25 CIFAR-100 adaptive: -0.5 EMNIST-CR fixed: 0.25 EMNIST-CR adaptive: 0.0 EMNIST-AE fixed: 0.5 EMNIST-AE adaptive: 0.5 SHAKESPEARE fixed: -0.25 SHAKESPEARE adaptive: -0.5 SO-NWP fixed: 1.0 SO-NWP adaptive: 0.5 SO-LR fixed: 0.25 SO-LR adaptive: 0.25

Hope that helps.

VasundharaAgarwal commented 2 years ago

Thank you so much @galenmandrew, that's very helpful.

I'm not sure I understand how a learning rate of 0.0 for EMNIST-CR adaptive would work. Surely the model won't get updated?

google-research / federated

[differential_privacy] Learning rates used for Adaptive Clipping experiments #59