Change how last bin counts are derived in histogram code

privacytoolsproject / PSI-Library

R library of differentially private algorithms for exploratory data analysis

6 stars 7 forks source link

Currently the histogram statistic calculates one bin of the histogram by calculating the sum of counts in all other bins then subtracting that from N.

https://github.com/privacytoolsproject/PSI-Library/blob/3b3e0cd958e423d576ce6fb198652d24d5fbeaf2/R/statistic-histogram.R#L136-L150

There are a few issues with this:

That bin will have a different standard deviation than the rest of them and we're not clear about that.
The function is called within histogram.formatRelease which as a function implies that it is just formatting the output not actively changing it, which is misleading.
Naming it compose also feels a bit misleading.

Instead, should just normalize all of the bins so that they sum to N.

Follow-up questions:

How should we deal with negative counts in the normalization? If we truncate them to 0 then the statement about histogram accuracy is no longer correct, but on the other hand negative counts are somewhat meaningless in this context.
In the old version, the counts were rounded so they were all integer values. If we do this after normalizing, they won't add correctly. We could resolve this in a couple different ways (i.e. could find the difference between the desired count total and the actual count, and distribute the difference randomly across the different counts) but this would change the accuracy guarantee, and I'm not sure what is best. I vaguely remember watching a presentation about the census work that had some fancy ways of getting around this since they have lots of counts that have to compose in nice ways? I can look into this, but want to check in first.

privacytoolsproject / PSI-Library

Change how last bin counts are derived in histogram code #56