opendp / smartnoise-sdk

Tools and service for differentially private processing of tabular and relational data
MIT License
245 stars 64 forks source link

DPGAN + PATECTGAN: strange behaviour increasing epsilon budget #606

Open GiuliaGualtieri opened 1 day ago

GiuliaGualtieri commented 1 day ago

Issue Description

I run a scrip performing a series of operations to test the accuracy and privacy of two of the available data synthesis methods: DPGAN vs PATECTGAN. (script : run_comparison.py available in the attached MATERIAL-SMARTNOISE.zip )

Why if I increase the budget, the RandomForest classifier is still able to distinguish the private synthetized dataset from the original one? I expect that with epsilon converging to infinity I create in a certain way the “ideal” GAN that can reproduce perfectly the original distribution of the PUMS dataset and so the accuracy of the classifier decreases as the classifier is not able to distinguish the origins of the data. While this does not happen? As you can see in this plot, the accuracy raises up to ~ 95%. Accuracy_DPGAN_PATEGAN_log(epsilon) There is value near to epsilon = 5.0 where PATECGAN does 62%. I’m asking: why then it’s getting worse? So, I decided to write you, in order to shed some lights about this behaviour cause maybe I’m doing something wrong during the training of NN or Random Forest binary classifier.

Environment

Commands

You could find all the scripts for running and compare the models in the attached MATERIAL-SMARTNOISE.zip .

Results

You could synthetic private data in csv format in the attached MATERIAL-SMARTNOISE.zip .

joshua-oss commented 1 day ago

There is a general issue of "mode collapse" with GAN-based synthesizers on tabular data, where infrequent combinations of attributes get suppressed in the output, and the distribution is biased towards the most frequent categories. This is something that happens even without differential privacy. The CT (conditional tabular) family of GANs attempt to fix this issue by oversampling rare categories, but the normal way of doing this unfortunately violates differential privacy. In cases where categories are fairly uniformly distributed, this might not be a major problem, but in general the GAN synthesizers will have a limited ability to model the data, even if no privacy is applied.