Closed globusharris closed 5 years ago
Follow-up questions:
How should we deal with negative counts in the normalization? If we truncate them to 0 then the statement about histogram accuracy is no longer correct, but on the other hand negative counts are somewhat meaningless in this context.
In the old version, the counts were rounded so they were all integer values. If we do this after normalizing, they won't add correctly. We could resolve this in a couple different ways (i.e. could find the difference between the desired count total and the actual count, and distribute the difference randomly across the different counts) but this would change the accuracy guarantee, and I'm not sure what is best. I vaguely remember watching a presentation about the census work that had some fancy ways of getting around this since they have lots of counts that have to compose in nice ways? I can look into this, but want to check in first.
Currently the histogram statistic calculates one bin of the histogram by calculating the sum of counts in all other bins then subtracting that from N.
https://github.com/privacytoolsproject/PSI-Library/blob/3b3e0cd958e423d576ce6fb198652d24d5fbeaf2/R/statistic-histogram.R#L136-L150
There are a few issues with this:
histogram.formatRelease
which as a function implies that it is just formatting the output not actively changing it, which is misleading.compose
also feels a bit misleading.Instead, should just normalize all of the bins so that they sum to N.