Histograms - add accuracy measures

raprasad commented 2 years ago

Note: There are two types of accuracy re: histograms

"universal" accuracy -- e.g. one accuracy measure for the entire histogram (not yet available)
accuracy for each cell/category (available)

This issue is to implement (2) ^ above.

Basic flow

Iterate through teach category and re-use the noise scale parameter.
This changes the results.
- Set the "global" accuracy value as None/null
- For the results, include a measure of accuracy for each category in tuple and dict formats.
- tuple format: [(category, accuracy), (category, accuracy), etc.]
  - use the current category order, and include accuracy for "uncategorized"
- dict format: {category: accuracy, category: accuracy, etc.}

files to change

Shoeboxam commented 2 years ago

The respective statistical terms are:

Simultaneous confidence intervals or family-wise error rate
Individual confidence intervals

For some more intuition on the difference, first consider a histogram with 99 categories, admitting 100 bins. If alpha = 0.05, then you'd expect 5 of the noisy counts to be outside of their CI's! Equivalently but with different wording, if I'm 95% confident in the intervals, and test the intervals 100 times, then on average 5 of the tests will fail.

If I want all of these dimensions to differ by no more than accuracy simultaneously, then I want a simultaneous CI. For a simultaneous CI, we want to widen the intervals enough so that I am at least 95% confident that all 100 released values are within their CI's. As in, if I were to release this histogram 100 times, I would expect that only 5 of the histograms contains a count that differs from the true count by more than the (simultaneous) accuracy.

There's a very straightforward way to derive loose simultaneous CI's with a Bonferroni correction. Lets assume again there are m=100 bins, so we want to simultaneously pass m individual tests. One can simply divide alpha by m, and pass this new, smaller alpha into the laplacian_scale_to_accuracy function. The resulting accuracy value is larger compared to the individual accuracy. We can say that all counts differ by at most this accuracy with (1-alpha)100% confidence.

There are ways of deriving simultaneous CI's that are tighter. You can think of the acceptance region derived by this Bonferroni method as an acceptance hypercube, whereas a hypersphere can be tighter, giving smaller accuracies with the same alpha (which is a statement about the L2 distance between the release and true counts). Deriving this is pretty straightforward for gaussian noise, as the gaussians compose nicely, and a little bit more annoying for laplacians. I wrote a bit about this here. For now, if you have an immediate need for simultaneous CI's, you could simply adjust alpha with a Bonferroni correction (alpha/m). It is not clear to me if we need these simultaneous CI's in the library at this point.

Shoeboxam commented 2 years ago

As to the data structure in this issue, it is not necessary to release separate accuracies for each individual bin/category, because the noise scale is identical for all bins/categories. Assuming you are using L1 noise (which you are), you can pass this noise scale parameter into laplacian_scale_to_accuracy, to get the accuracy of each cell. This only needs to be done once.

opendp / dpcreator

Histograms - add accuracy measures #495

Basic flow

files to change