Closed SamuelGong closed 2 years ago
Hi Zhifeng Jiang,
Thanks for bringing this to our attention. This is weird. The numbers up until bits = 17 make sense to us. Let us try to reproduce the error you are seeing and figure out what is happening. My guess is that some of the scipy optimizers we use under the hood are failing to converge -- notice how the scale (which should be >> 1) is much smaller than 1. We will try to look into it soon and get back to you. In the meantime, feel free to play with the range of the values for the scipy optimizers (to see if it becomes more stable).
Thanks, peter
On Mon, Oct 10, 2022 at 10:13 AM Zhifeng Jiang @.***> wrote:
To whom it may concern,
When using the following interface to calculate the local_stddev for DSkellam,
I obtained different (scale, local_stddev) pairs when I tried different configurations. So what I felt strange was that when I swept the bits from 20 to 16 when keeping other stuff, all the same, I saw the results
- First change smoothly when bits $>= 17$;
- All of a sudden, the scale and local_stddev both change drastically, implying that they were no longer useful for actual training (e.g., the local_stddev has mounted to server thousands).
For your convenience, I also made a table for the result: bits 20 19 18 17 16 local_stddev 5.23 5.29 5.52 6.88 162582.83
Please refer to the appended execution results for more details. It seems that somewhere in the optimization process broke if we are using some bits that is too small. My questions:
- Did anyone ever encounter a similar issue to mine?
- Does that indicate a limitation for DSkellam, e.g., not necessarily fit for low-bit quantization? (But I think 16bits should not belong to a typical "low-bit" scenario.)
Thanks for the attention and really looking forward to your help!
Detailed execution results.
skellam_params(epsilon=6, ... l2_clip=3, ... bits=20, ... num_clients=16, ... dim=16777216, ... q=1.0, ... steps=150, ... beta=np.exp(-0.5), ... delta=0.01) (8348.494662007102, 5.233357786360573)
skellam_params(epsilon=6, ... l2_clip=3, ... bits=19, ... num_clients=16, ... dim=16777216, ... q=1.0, ... steps=150, ... beta=np.exp(-0.5), ... delta=0.01) (4132.065287771359, 5.286782384491639)
skellam_params(epsilon=6, ... l2_clip=3, ... bits=18, ... num_clients=16, ... dim=16777216, ... q=1.0, ... steps=150, ... beta=np.exp(-0.5), ... delta=0.01) (1979.514945816654, 5.517849242526566)
skellam_params(epsilon=6, ... l2_clip=3, ... bits=17, ... num_clients=16, ... dim=16777216, ... q=1.0, ... steps=150, ... beta=np.exp(-0.5), ... delta=0.01) (793.6147645649165, 6.881591755490655)
skellam_params(epsilon=6, ... l2_clip=3, ... bits=16, ... num_clients=16, ... dim=16777216, ... q=1.0, ... steps=150, ... beta=np.exp(-0.5), ... delta=0.01) (0.02190604694235002, 162582.82632490725)
— Reply to this email directly, view it on GitHub https://github.com/google-research/federated/issues/66, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB7LCU2AVGTCZRAYYQNYG7DWCRFDXANCNFSM6AAAAAARBQ7WIA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
Hi @SamuelGong,
Thanks for your interest in our work! Just adding to @kairouzp above, a short answer is basically that our heuristic to balance quantisation error & modulo clipping error (what skellam_params
is doing) determines that the aggregate at the server won’t fit into the bitwidth (16) for the last set of configs, so it is giving you this boundary behaviour with a very small scale
and a meaningless local_stddev
. Since skellam_params
simply tries to find a scale
automatically, you can also manually specify a scale
for bits=16
for SecAgg and get a reasonable local_stddev
like so:
skellam_local_stddev(epsilon=6,
scale=350, # `scale`
l2_clip=3,
num_clients=16,
beta=np.exp(-0.5),
dim=16777216,
q=1,
steps=150,
delta=0.01,
orders=RDP_ORDERS)
11.43585395098052 # `local_stddev`
(compare the above to your bits=17
output)
More specifically, the heuristic skellam_params
works as follows: for a given bitwidth (say bits=16
), it tries to select scale
and local_stddev
such that we can fit roughly k=3
stddevs of the aggregate signal inside the range [0, 2^bits)
. This “aggregate signal” contains the sum of (1) the scaled client updates and (2) their locally added (scaled) noise. This means that the k=3
stddevs of the aggregate might not fit if you choose a setting with, say, a large clipping bound or strong privacy (e.g. no client subsampling). We also approximate the stddev of this aggregate in a loose way (see “Shared scale” line of Algorithm 1 of the paper) due to the loose norm bound on (1) that gets worse with larger dim
, meaning that it could be overestimating the bitwidth that the aggregate needs. To “make more room” for the aggregate (thus allowing a larger scale
and a useful local_stddev
), you can manually adjust the quantization/modular clipping tradeoff by using a smaller k
like so:
skellam_params(epsilon=6,
l2_clip=3,
bits=16,
num_clients=16,
dim=16777216,
q=1.0,
steps=150,
beta=np.exp(-0.5),
k=2, # note here
delta=0.01)
(387.69384343074285, 10.56503703173974)
Choosing a smaller l2_clip
, dim
, steps
, q
should also work as they all determine the “size” of the aggregate (either in terms of component (1) or (2) above).
If having small bitwidths is important, check out an extension to this work (https://arxiv.org/pdf/2203.03761.pdf) which considers using sparse linear projections.
I hope this helps!
Thank @kairouzp and @kenziyuliu for the informative replies which tell me that:
(1) The issue happens as the variance of the aggregated signal cannot be bounded in the field size of SecAgg, and thus
(2) Using a reasonably larger num_bits
can fix the problem by enlarging SecAgg's field size;
(3) Alternatively, the problem can also be walked around by using a smaller k
/l2_clip
/dim
/steps
/q
so that the variance of the aggregated signal can be reduced.
(4) Moreover, if I want to keep all these things still, then I can resort to manually setting the scale
and use the function skellam_local_stddev
instead of skellam_params
which fails to obtain a meaningful scale
.
Appreciate that!
To whom it may concern,
When using the following interface to calculate the
local_stddev
for DSkellam,https://github.com/google-research/federated/blob/76aa6f665248b8a2e5b07cc52a68b4c4b94fd06d/distributed_dp/accounting_utils.py#L570
I obtained different
(scale, local_stddev)
pairs when I tried different configurations. So what I felt strange was that when I swept thebits
from20
to16
when keeping other stuff, all the same, I saw the resultsscale
andlocal_stddev
both change drastically, implying that they were no longer useful for actual training (e.g., thelocal_stddev
has mounted to server thousands).For your convenience, I also made a table for the result:
Please refer to the appended execution results for more details. It seems that somewhere in the optimization process broke if we are using some
bits
that is too small. My questions:16bits
should not belong to a typical "low-bit" scenario.)Thanks for the attention and really looking forward to your help!
Detailed execution results.