sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Differential Privacy parameter inclusion in generative model definition #779

Open Ishita-0112 opened 2 years ago

Ishita-0112 commented 2 years ago

Problem Description

Differential privacy preserves an individual’s privacy via adding some random noise into the dataset while performing data analysis. However, after adding noise, the output of the analysis turns into an approximation. Ɛ (Epsilon), the privacy loss parameter determines the quantity of the noise to be introduced into the system. The epsilon can be derived from the probability distribution, known as, which determines the amount of deviation present in the computation if one of the data attributes has been removed from the dataset. A differentially private synthetic dataset is generated from a statistical model based on the original dataset. The synthetic dataset represents a “fake” sample derived from the original data while retaining as many statistical characteristics as possible. The essential advantage of the synthesizer approach is that the differentially private dataset can be analyzed any number of times without increasing the privacy risk. While the synthetic dataset embodies the original data’s essential properties, it is mathematically impossible to preserve the full data value and with record-level privacy at the same time. Hence, differential privacy can help in ensuring better privacy, for instance, Personally Identifiable Information (PII) of customers.

Expected behavior

Epsilon (Ɛ) can be added as a part of the additional constraints that SDV already allows us to define for any generative model.

s = generate(d){
        model: ‘gc’;
        epochs: ‘300’;
        batch_size: ‘500’;
        log_frequency: ’true’;
        embedding_dim: 128;

            generator_lr: ‘2e-4’;
        generator_decay: ‘1e-6’;
        generator_dim: (256,256);

        discriminator_lr: ‘2e-4’;
        discriminator_decay: ‘1e-6’;
            discriminator_steps: ‘1’;
        discriminator_dim: (256,256);

            verbose: ‘false’;
                diff_privacy: 0.01;
}

Additional context

The differential privacy aspect can be implemented in the generative model python files, for example adding differential privacy into copula models, we can define the logic here.

npatki commented 2 years ago

Thanks for filing this feature request @Ishita-0112. We'll keep this open and use it to track progress and updates.

To help us prioritize, it would be great if you can describe your use case. How are you planning to use the synthetic data & who are you planning to share it with?

leslyarun commented 2 years ago

@npatki I think this is a great idea. Commercial tools like Gretel.ai allows you to train the model with/without differential privacy. So having DP as an option during training will be a very big feature

Ishita-0112 commented 2 years ago

Thanks for filing this feature request @Ishita-0112. We'll keep this open and use it to track progress and updates.

To help us prioritize, it would be great if you can describe your use case. How are you planning to use the synthetic data & who are you planning to share it with?

Thanks @npatki and @leslyarun. So to address your question, working with a MNC sharing data to a third party to perform some analytics is the motive behind our use case.

npatki commented 2 years ago

Thanks for the details! This will help us to plan and prioritize this issue.