privacytoolsproject / PSI-Library

R library of differentially private algorithms for exploratory data analysis
6 stars 7 forks source link

Add NA bin to histogram when user enters a list of histogram bins that are a subset of all variable levels (tabled until library restructured) #91

Open MeganFantes opened 5 years ago

MeganFantes commented 5 years ago

Right now there are 2 cases when making a histogram for a categorical variable: 1) The user enters a list of bins, and the laplace mechanism is used 2) The user does NOT enter a list of bins, and the stability mechanism is used

We want to implement a third case: 3) the user enters a list of bins, but the list is a subset of the full list of levels the variable takes. So we add an NA bins to the list of bins, set all levels that were not entered in the list of bins to NA, and then use the stability mechanism

In implementing this third case, we will use the existing histogramCategoricalBins function in utilities-histogram.R

MeganFantes commented 5 years ago

Updated idea:

Do not implement a third case, instead change the first case:

1) bins entered: use Laplace mechanism, check impute parameter, always add NA bucket if impute = False 2) bins not entered: use stability mechanism

Need to update histogram vignette to make sure impute is used in all contexts

MeganFantes commented 5 years ago

Ira and I discussed this at length, and we decided this issue should be tabled for now.

Given the way the library is structured now, where there are export() statements in the statistics to call the mechanisms, there is no logical way to set a local attribute in a subclass and the check for its existence.

We plan to do major restructuring of the library to have the mechanisms and statistics be completely separate entities, and in this case it will be more possible to set impute as an attribute of only the histogram statistic.

When the library is restructured, we can revisit the issue of conditioning the call to fillMissing() on impute for the histogram statistic.