By default, SQL columns can contain NULL values, representing missing values. When values are required, It is common for schemas to specify non-nullable. If we know which columns are not nullable, we can reduce privacy cost and improve consistency. For example, in a query like SELECT COUNT(*), AVG(age), AVG(income), AVG(educ), all 3 averages would be guaranteed to have the same denominator, which would be the same as COUNT(*), so the same noisy value could be used for all.
We could achieve the same gains by, allowing NULL to be automatically imputed in pre-processing, we could guarantee no missing values. This is easy for fixed imputation, such as imputing missing values as 0. However, imputation that relies on the underlying data (e.g. mean, median, regression) could leak privacy. Calculating a differentially private imputed value will typically cost more than benefit. For example, in the query above, spending budget to calculate the nullable mean (to determine the imputed value), then calculating the average over the full non-null data, would be wasteful, since the mean won't change. For this reason, it seems best to push that sort of imputation to the data curators.
Implementation would add two attributes to column metadata:
nullable: boolean, default is True. If set to False, would re-use counts where identical.
missing: can be assigned the same type as the column, for example, missing: 0.0.
By default, SQL columns can contain NULL values, representing missing values. When values are required, It is common for schemas to specify non-nullable. If we know which columns are not nullable, we can reduce privacy cost and improve consistency. For example, in a query like
SELECT COUNT(*), AVG(age), AVG(income), AVG(educ)
, all 3 averages would be guaranteed to have the same denominator, which would be the same asCOUNT(*)
, so the same noisy value could be used for all.We could achieve the same gains by, allowing NULL to be automatically imputed in pre-processing, we could guarantee no missing values. This is easy for fixed imputation, such as imputing missing values as 0. However, imputation that relies on the underlying data (e.g. mean, median, regression) could leak privacy. Calculating a differentially private imputed value will typically cost more than benefit. For example, in the query above, spending budget to calculate the nullable mean (to determine the imputed value), then calculating the average over the full non-null data, would be wasteful, since the mean won't change. For this reason, it seems best to push that sort of imputation to the data curators.
Implementation would add two attributes to column metadata:
nullable
: boolean, default is True. If set to False, would re-use counts where identical.missing
: can be assigned the same type as the column, for example,missing: 0.0
.