pytorch / opacus

Training PyTorch models with differential privacy
https://opacus.ai
Apache License 2.0
1.65k stars 328 forks source link

Privacy Leakage at low sample size #571

Closed tudorcebere closed 2 months ago

tudorcebere commented 1 year ago

🐛 Bug

When using opacus at low sample sizes (~2-3 samples), I managed to leak more privacy than the accounting described:

Link: https://colab.research.google.com/drive/1gZVrg9kPIWjibApBkEnKNQqaIn8kUySs?usp=sharing

The privacy estimation is made as in: https://proceedings.neurips.cc/paper/2020/file/fc4ddc15f9f4b4b06ef7844d6bb53abf-Paper.pdf

The idea is as follows:

  1. Craft some worst-case D and D'
  2. Run your mechanism (a linear regression trained via DP-SGD in this case) on one of your datasets by flipping a coin b
  3. The adversary outputs a score (in this case, it's a dimension that behaves pretty worst-case)
  4. Select the best threshold the adversary could pick to correctly guess the values of b
  5. Estimate the privacy as in the paper above.

When the cardinality is high, privacy is tight, but at a low sample size, privacy guarantees get violated (check last print, the reported epsilon by opacus is ~1.2, while I leak ~2.5).

I shared this with @alexandresablayrolles and he mentioned that the problem might be that, in this scenario, opacus is leaking the cardinality of the underlying dataset/batch (which is private).

I am curious to hear further thoughts/feedback and help patch this.

alexandresablayrolles commented 1 year ago

Thanks for posting the issue. As we discussed offline with @tudorcebere, the issue is the following: Opacus estimates the expected_batch_size by multiplying sample_rate and dataset_size, which means that the dataset_size, a non-private quantity, leaks into the optimizer.

This is not a problem for large datasets as both D and D' will have similar dataset_size, but for small datasets the difference becomes apparent.

I would say a short term solution might be to just print a warning if dataset_size is too small (say, lower than 10). Any thoughts @ffuuugor ?

ffuuugor commented 1 year ago

huh, interesting

Agree on the short-term approach, doesn't look like a big deal for most scenarios, but nice to cover the extreme cases too.

What would be a potential proper solution? Add noise to the dataset_size? Switch to replacement DP definition?

alexandresablayrolles commented 1 year ago

Switch to replacement DP

That would solve this problem but we would need to essentially double the noise since the sensitivity is now 2 (the diameter of the sphere instead of its radius).

Add noise to the dataset_size

That is the proper solution IMO. It is a bit annoying because users don't necessarily expect this, and it means that we have to take some budget off the training budget to estimate the dataset size.

One possible implementation is to ask for n_data in make_private. Users can either give a numeric value, or give n_data="estimate" to get the noisy estimate (with say sigma=20 so that it is negligible w.r.t. overall budget but still accounted for), and the default is n_data="auto" that takes the length of the dataset (as currently done). When n_data="auto" we either issue a warning that it can leak some privacy, and issue an error if secure_mode is activated.

tudorcebere commented 1 year ago

If everyone is fine with this fix, I can try and implement it?

Just as a remark, we assume that at big cardinality this is not relevant, but I think it depends on the sigma too, right? If sigma is very small , this can affect the results, right?

As an alternative fix: the gradient is computed via noisy sum. Why we can't do noisy mean? (we don't release the cardinality, but the mechanism scales the noise according to the mean, instead of to the sum)

karthikprasad commented 1 year ago

One possible implementation is to ask for n_data in make_private.

I like this direction, but how about reusing secure_mode flag instead of introducing a new arg? Printing a warning in most cases should be fine (we need to have a detailed answer on the FAQ page so as to not spook non-expert users) and switch to "estimate" approach in secure_mode

alexandresablayrolles commented 1 year ago

Just realized that another option is to just set expected_batch_size to the dataloader batch size? Seems more easier and more natural.