opendp / dpcreator

Web application that makes data releases that satisfy differential privacy using the OpenDP Library
MIT License
20 stars 4 forks source link

Histograms - add accuracy measures #495

Closed raprasad closed 2 years ago

raprasad commented 2 years ago

Note: There are two types of accuracy re: histograms

  1. "universal" accuracy -- e.g. one accuracy measure for the entire histogram (not yet available)
  2. accuracy for each cell/category (available)

This issue is to implement (2) ^ above.

Basic flow

files to change

Shoeboxam commented 2 years ago

The respective statistical terms are:

  1. Simultaneous confidence intervals or family-wise error rate
  2. Individual confidence intervals

For some more intuition on the difference, first consider a histogram with 99 categories, admitting 100 bins. If alpha = 0.05, then you'd expect 5 of the noisy counts to be outside of their CI's! Equivalently but with different wording, if I'm 95% confident in the intervals, and test the intervals 100 times, then on average 5 of the tests will fail.

If I want all of these dimensions to differ by no more than accuracy simultaneously, then I want a simultaneous CI. For a simultaneous CI, we want to widen the intervals enough so that I am at least 95% confident that all 100 released values are within their CI's. As in, if I were to release this histogram 100 times, I would expect that only 5 of the histograms contains a count that differs from the true count by more than the (simultaneous) accuracy.

There's a very straightforward way to derive loose simultaneous CI's with a Bonferroni correction. Lets assume again there are m=100 bins, so we want to simultaneously pass m individual tests. One can simply divide alpha by m, and pass this new, smaller alpha into the laplacian_scale_to_accuracy function. The resulting accuracy value is larger compared to the individual accuracy. We can say that all counts differ by at most this accuracy with (1-alpha)100% confidence.

There are ways of deriving simultaneous CI's that are tighter. You can think of the acceptance region derived by this Bonferroni method as an acceptance hypercube, whereas a hypersphere can be tighter, giving smaller accuracies with the same alpha (which is a statement about the L2 distance between the release and true counts). Deriving this is pretty straightforward for gaussian noise, as the gaussians compose nicely, and a little bit more annoying for laplacians. I wrote a bit about this here. For now, if you have an immediate need for simultaneous CI's, you could simply adjust alpha with a Bonferroni correction (alpha/m). It is not clear to me if we need these simultaneous CI's in the library at this point.

Shoeboxam commented 2 years ago

As to the data structure in this issue, it is not necessary to release separate accuracies for each individual bin/category, because the noise scale is identical for all bins/categories. Assuming you are using L1 noise (which you are), you can pass this noise scale parameter into laplacian_scale_to_accuracy, to get the accuracy of each cell. This only needs to be done once.