opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
254 stars 68 forks source link

Preserving the distribution - smartnoise-synth #564

Closed borisRa closed 1 year ago

borisRa commented 1 year ago

Hi,

I have a question regarding smartnoise-synth . The smartnoise-synth tool ,does it maintain the correlation between attributes when generating data ? Or does it generate column-wise distribution data only ?

Thanks, Boris

joshua-oss commented 1 year ago

The synthesizers all attempt to learn a joint distribution, with varying degrees of competence. For example, MWEM by default will measure 2-way combinations of columns, but can be configured to measure n-way, including a fixed query workload (if you have columns that you want to ensure preserve correlations). MST can often perform better, because it attempts to first determine which columns have correlations, and then ensures that more privacy budget is spent on measuring those. AIM is similar, but can also take predefined workloads. The GAN-based synthesizers can handle much higher dimensionality by approximating the distribution, but have somewhat unpredictable behavior at preserving n-way marginals. Because the performance of each synthesizer depends heavily on the data distribution, pre-processing, and hyperparameters, you will want to run some tests to choose the best approach.

borisRa commented 1 year ago

competence. For example, MWEM by default will measure 2-way combinations of columns, but can be configured to measure n-way, including a fixed query workload (if you have columns that you want to ensure preserve correlations). MST can often perform better, because it attempts to first determine which columns have correlations, and then ensures that more privacy budget is spent on measuring those. AIM is similar, but can also take predefined workloads. The GAN-based synthesizers can handle much higher dimensionality by approximating the distribution, but have somewhat unpredictable behavior at preserving n-way marginals. Because the performance of each synthesizer depends heavily on the data distribution, pre-processing, and hyperparameters, you will want to run some tests to choose the best approach.

Thank you for the detailed answer !