vanderschaarlab / synthcity

A library for generating and evaluating synthetic tabular data for privacy, fairness and data augmentation.
https://www.vanderschaar-lab.com/
Apache License 2.0
454 stars 62 forks source link

Clarification regarding the `cond` arg role in CTGAN and TVAE training #151

Closed dionman closed 1 year ago

dionman commented 1 year ago

What's the purpose of passing a cond argument at the fit method for the CTGAN and TVAE models?

dionman commented 1 year ago

Would leaving this argument empty correspond to an ablation of the original models where conditioning by sampling is inactive?

robsdavis commented 1 year ago

Hi @dionman,

In Synthcity a cond can be passed in order to encourage generative models to create records with specified values for a given field, that is, a "conditional" is applied to the synthetic data.

Passing a cond to fit prepares the synthetic model to generate records with the conditional applied. The cond passed here is usually something like the column from the real dataframe that you want to specify values for in the synthetic records.

Passing cond to generate applies that conditional to the generation of the synthetic records. Here cond should be something like a numpy array/pandas series, which has the same length as the cond passed to fit, but contains the values you want to encourage the synthetic records to have.

Leaving cond blank will cause the model to generate records based purely on the real data provided with no conditional applied. If you want to use a conditional, you should pass appropriate values to both fit and generate. If not, do not pass anything to either fit or generate.

There are examples of cond being used in the tutorials, e.g. tutorial0_basic_examples.ipynb and tutorial3_survival_analysis.

Hope that helps :smile:

ZhaozhiQIAN commented 1 year ago

Hi @dionman , more specifically about the CT-GAN model.

Leaving cond empty corresponds to the original CT-GAN, where training-by-sampling is enabled (i.e. a random column is selected and conditioned on during each training iteration).

When passing an cond argument, the model will be a conditional GAN, i.e. P(X | cond). In this case, the same conditional column cond will be used throughout training, which differs from the training-by-sampling algorithm.

dionman commented 1 year ago

Thanks for clarifying! Great work!

dionman commented 1 year ago

Hi @robsdavis @ZhaozhiQIAN So, as far as I understand, training and sampling with cond corresponding to the class column, guarantees that the classes in the sampled synthetic data population are as specified (without the need of adding related constraints) -- precisely because in the conditional CTGAN and TVAE the provided cond subvectors will always be part of the sampled output. Is this correct?

robsdavis commented 1 year ago

Using a cond does not guarantee that the synthetic data is strictly as specified. For example, if you try and use a conditional to create many records of a class that is not well represented in the dataset, the model will try to create new records in this class, but fall short in that class (it cannot extrapolate from nowhere) but still generate count records. Only using constraints can guarantee classes are exactly as specified.

dionman commented 1 year ago

how do constraints handle the situation in the example you described?

robsdavis commented 1 year ago

Constraints are strict, but simple. They are model agnostic, and will filter the output of the generative model. Often, if you want to create a quality synthetic dataset, but with a different distribution across a certain class cond is what you want. If you want a strict rule to be obeyed, for example, if a certain value is not valid in your dataset, constraints are what you want.