Closed dionman closed 1 year ago
Would leaving this argument empty correspond to an ablation of the original models where conditioning by sampling is inactive?
Hi @dionman,
In Synthcity a cond
can be passed in order to encourage generative models to create records with specified values for a given field, that is, a "conditional" is applied to the synthetic data.
Passing a cond
to fit
prepares the synthetic model to generate records with the conditional applied. The cond
passed here is usually something like the column from the real dataframe that you want to specify values for in the synthetic records.
Passing cond
to generate
applies that conditional to the generation of the synthetic records. Here cond
should be something like a numpy array/pandas series, which has the same length as the cond
passed to fit
, but contains the values you want to encourage the synthetic records to have.
Leaving cond
blank will cause the model to generate records based purely on the real data provided with no conditional applied. If you want to use a conditional, you should pass appropriate values to both fit
and generate
. If not, do not pass anything to either fit
or generate
.
There are examples of cond
being used in the tutorials, e.g. tutorial0_basic_examples.ipynb and tutorial3_survival_analysis.
Hope that helps :smile:
Hi @dionman , more specifically about the CT-GAN model.
Leaving cond
empty corresponds to the original CT-GAN, where training-by-sampling is enabled (i.e. a random column is selected and conditioned on during each training iteration).
When passing an cond
argument, the model will be a conditional GAN, i.e. P(X | cond). In this case, the same conditional column cond
will be used throughout training, which differs from the training-by-sampling algorithm.
Thanks for clarifying! Great work!
Hi @robsdavis @ZhaozhiQIAN
So, as far as I understand, training and sampling with cond
corresponding to the class column, guarantees that the classes in the sampled synthetic data population are as specified (without the need of adding related constraints) -- precisely because in the conditional CTGAN and TVAE the provided cond
subvectors will always be part of the sampled output. Is this correct?
Using a cond
does not guarantee that the synthetic data is strictly as specified. For example, if you try and use a conditional to create many records of a class that is not well represented in the dataset, the model will try to create new records in this class, but fall short in that class (it cannot extrapolate from nowhere) but still generate count
records. Only using constraints can guarantee classes are exactly as specified.
how do constraints handle the situation in the example you described?
Constraints are strict, but simple. They are model agnostic, and will filter the output of the generative model. Often, if you want to create a quality synthetic dataset, but with a different distribution across a certain class cond
is what you want. If you want a strict rule to be obeyed, for example, if a certain value is not valid in your dataset, constraints
are what you want.
What's the purpose of passing a
cond
argument at thefit
method for the CTGAN and TVAE models?