Closed raprasad closed 2 years ago
The respective statistical terms are:
For some more intuition on the difference, first consider a histogram with 99 categories, admitting 100 bins. If alpha = 0.05, then you'd expect 5 of the noisy counts to be outside of their CI's! Equivalently but with different wording, if I'm 95% confident in the intervals, and test the intervals 100 times, then on average 5 of the tests will fail.
If I want all of these dimensions to differ by no more than accuracy simultaneously, then I want a simultaneous CI. For a simultaneous CI, we want to widen the intervals enough so that I am at least 95% confident that all 100 released values are within their CI's. As in, if I were to release this histogram 100 times, I would expect that only 5 of the histograms contains a count that differs from the true count by more than the (simultaneous) accuracy.
There's a very straightforward way to derive loose simultaneous CI's with a Bonferroni correction. Lets assume again there are m=100
bins, so we want to simultaneously pass m
individual tests. One can simply divide alpha by m
, and pass this new, smaller alpha into the laplacian_scale_to_accuracy function. The resulting accuracy value is larger compared to the individual accuracy. We can say that all counts differ by at most this accuracy with (1-alpha)100% confidence.
There are ways of deriving simultaneous CI's that are tighter. You can think of the acceptance region derived by this Bonferroni method as an acceptance hypercube, whereas a hypersphere can be tighter, giving smaller accuracies with the same alpha (which is a statement about the L2 distance between the release and true counts). Deriving this is pretty straightforward for gaussian noise, as the gaussians compose nicely, and a little bit more annoying for laplacians. I wrote a bit about this here. For now, if you have an immediate need for simultaneous CI's, you could simply adjust alpha with a Bonferroni correction (alpha/m
). It is not clear to me if we need these simultaneous CI's in the library at this point.
As to the data structure in this issue, it is not necessary to release separate accuracies for each individual bin/category, because the noise scale is identical for all bins/categories. Assuming you are using L1 noise (which you are), you can pass this noise scale parameter into laplacian_scale_to_accuracy, to get the accuracy of each cell. This only needs to be done once.
Note: There are two types of accuracy re: histograms
This issue is to implement (2) ^ above.
Basic flow
[(category, accuracy), (category, accuracy), etc.]
{category: accuracy, category: accuracy, etc.}
files to change