opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
255 stars 70 forks source link

Implement Nullable and Fixed Impute #418

Closed joshua-oss closed 3 years ago

joshua-oss commented 3 years ago

By default, SQL columns can contain NULL values, representing missing values. When values are required, It is common for schemas to specify non-nullable. If we know which columns are not nullable, we can reduce privacy cost and improve consistency. For example, in a query like SELECT COUNT(*), AVG(age), AVG(income), AVG(educ), all 3 averages would be guaranteed to have the same denominator, which would be the same as COUNT(*), so the same noisy value could be used for all.

We could achieve the same gains by, allowing NULL to be automatically imputed in pre-processing, we could guarantee no missing values. This is easy for fixed imputation, such as imputing missing values as 0. However, imputation that relies on the underlying data (e.g. mean, median, regression) could leak privacy. Calculating a differentially private imputed value will typically cost more than benefit. For example, in the query above, spending budget to calculate the nullable mean (to determine the imputed value), then calculating the average over the full non-null data, would be wasteful, since the mean won't change. For this reason, it seems best to push that sort of imputation to the data curators.

Implementation would add two attributes to column metadata:

joshua-oss commented 3 years ago

Implemented in #435