opendp / smartnoise-core

Differential privacy validator and runtime
MIT License
290 stars 33 forks source link

[Question] Generate differential private release of a dataset? #317

Closed sasvaritoni closed 3 years ago

sasvaritoni commented 3 years ago

Hi,

As I can understand, OpenDP is mainly meant for different statistics queries. I am wondering if OpenDP could be used to generate a differential private release of a dataset. I mean to transform the original dataset to an "anonymized" one. Is this planned for the future maybe?

Thanks, Toni

Shoeboxam commented 3 years ago

Sorry for the delayed response. We've made adjustments to improve notifications for questions raised like this.

There are utilities for generating synthetic data in the SmartNoise SDK. Synthetic data generation has not yet been integrated into the core library. You can find documentation for using the synthetic data features in the SmartNoise SDK here.

tercer commented 3 years ago

There are three possible ways to generate publicly releaseable datasets that are differentially private.

  1. Create synthetic data from a DP generated model (eg. the sufficient statistics, or a DP-GAN)
  2. Aggregate the raw data down into tables of aggregates, that are themselves DP
  3. Add DP noise to every individual cell value in the matrix, eg. the Local model (LDP).

Approach 2 is very doable in the present SmartNoise Core release, and in the forthcoming OpenDP library.
Approach 3 (LDP) is not presently in scope for SmartNoise/OpenDP, but could be if someone wants to run with it. The mechanisms are there in the library, we just haven't thought seriously about all the required utilities. Approach 1 is possible, but the models presently supported (VCV matrices) are not high in utility. There is some additional DP-GAN code available under the SmartNoise umbrella, but not currently integrated with the SmartNoise Core library. It exists as a separate service/process. This likely is in scope, but we don't yet have a timeline for.