opendp / smartnoise-core

Differential privacy validator and runtime
MIT License
290 stars 33 forks source link

Whitepaper bug/code clarification: Derivation of the sensitivity of the mean #355

Closed raprasad closed 2 years ago

raprasad commented 3 years ago
dkifer commented 3 years ago

I closed issue 356 (same topic) and put the comments here. Possibly all of the docs should be checked because the bug is fairly generic. Comments from issue 356:

It appears that the sensitivity calculations in this document (and possibly others) are incorrect: https://github.com/opendp/smartnoise-core/blob/develop/whitepapers/sensitivities/mean/mean.pdf

Specifically, Section 2.1 covers the add/remove version of sensitivity of the mean. In this case, if you add or remove a person, the quantity n (number of people in the database) cannot be used for sensitivity calculations. The bug in the proof is that it considers neighbors of the current dataset (from which it picks out n), while it should be considering all possible pairs of datasets that are neighbors of each other.

A secondary problem is that the mean for an empty dataset is not defined, but has to be factored into the computation of sensitivity.

Typically, for unbounded differential privacy (add/remove a person version), the mean is computed by adding noise to the numerator, adding noise to the denominator, and then dividing. In practice, it is best to return not just the noisy mean, but also the noisy numerator and noisy denominator since the user already "paid" the privacy cost for them.

raprasad commented 3 years ago

@dkifer, thanks for the details

Shoeboxam commented 3 years ago

@dkifer At this point, I think the proof is in need of a clarification, not a bug fix. I agree that if N is unknown, it would be completely inappropriate to make use of it in the proof.

From the theory perspective, this is a restricted-sensitivity proof, in the sense that it only applies when N is known. There are two cases where N can be known- the first when N is public given the context of the data, and the second when the dataset is preprocessed to be of a user-provided length N (either through sampling without amplification, or through imputation). This constraint restricts the set of neighboring datasets.

From the library perspective, this preprocessing is handled by a 1-stable "resize" transform that handles both of the above cases and establishes a static property on the data. The dp-mean that depends on this proof also depends on this static property to ensure that N is known and is non-zero. The library rejects the analysis if this property is not satisfied.

The library has two implementations of the mean. You can switch to the mean you describe by passing an argument implementation="plug-in" to the sn.dp_mean function. If the plug-in estimates are desired, then call the sum and count estimators separately and postprocess.

mean = sum / sn.row_max(count, 1)

I'm really grateful for your interest in these proofs and would be happy to look into this further if you think we're still in error. At minimum I'll open a PR with clarifications on the whitepapers.

EDIT: The remove case when N = 1 needs a closer look. There's also potentially a tighter formula on symmetric distance/bounded DP that I could look into backporting from the OpenDP library.

dkifer commented 3 years ago

Adding this kind of context to the documentation will be great. Looking forward to see the update.