sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.31k stars 304 forks source link

HMASynthesizer sometimes creates null values (out-of-bounds parameters synthesized) #1691

Closed npatki closed 7 months ago

npatki commented 10 months ago

Environment Details

Error Description

The SDV is only supposed to synthesize null values if the real data also has null values. However, in some cases, the HMA Synthesizer creates erroneous null values (in columns that are not supposed to have these). These incorrect nulls only appear in child tables (i.e. tables with a parent). The root tables are unaffected/do not contain any of these nulls.

Root Cause

The HMA algorithm works by summarizing the distribution of children -- for eg, using a Beta distribution, it summarizes the childen using the parameters alpha, beta, loc and scale. It then models the parameters and creates new ones from scratch during sampling.

Unfortunately, the new parameters that are sampled are not guaranteed to be in-bounds. So there is a chance that the sampled alpha or beta parameter will be <0. This is invalid for a Beta distribution, which is only defined when alpha and beta are >0.

Fixes

HMA should apply a FloatFormatter to each of the extended columns (for marginal distributions as well as the covariance columns). The FloatFormatter should be set up to clip the synthesized min/max values.

FloatFormatter(enforce_min_max_values=True)

Note that these transformers should be accessible after fitting in an easy-to-understand way. For HSA, we are using the parameter extended_columns. We should do the same here.

>>> synthesizer.extended_columns['my_table_name']
{
   '<extended_column_name>': FloatFormatter(enforce_min_max_values=True),
   '<extended_column_name>: FloatFormatter(enforce_min_max_values=True),
  ...
}

A more robust option for HMA would be to apply some kind of transformer to the extended columns (for each parameter: alpha, beta, etc.). This transformer could be responsible for clipping the min/max values in case they are synthesized to be out-of-bounds.

There is an issue for this in Copulas (see issue 367). However, Copulas is not really expected to work with invalid parameter values.

npatki commented 8 months ago

Workarounds

Option 1: Users encountering this issue may have better luck with using the 'truncnorm' (or 'norm') distribution rather than the default 'beta' distribution. This is not a guaranteed fix, but it makes it much less likely for the synthesizer to run into this issue.

Use the code below to adjust the distribution.

from sdv.multi_table import HMASynthesizer

 # TODO replace with your table names
TABLE_NAMES = ['users', 'sessions', 'transactions', ...]

synthesizer = HMASynthesizer(metadata)

for table_name in TABLE_NAMES:
  synthesizer.set_table_parameters(
  table_name=table_name,
  table_parameters={
    'enforce_min_max_values': True,
    'default_distribution': 'truncnorm'})

Option 2: Use the HSASynthesizer, as this uses a different algorithm so does not have the same bug.