sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.37k stars 317 forks source link

Numerical Instability in Constrained GaussianCopula #806

Open tlranda opened 2 years ago

tlranda commented 2 years ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

When using integer values in a constrained GaussianCopula, numerical instability can cause valid inputs to fail to produce any rows with the sample_conditions() call. This doesn't happen for all inputs, but the bug can be reliably produced for some value(s) in the range of valid inputs for virtually any Between constraint instance. I do not believe the problem is strictly isolated to Between, but I have not attempted to reproduce the error for other constraint classes.

What happens

Given a particular Between constraint bounded by low and high, conditional sampling of some values such that low < x < high fails to produce any values, but other valid inputs work as expected. The following exception traceback will be generated, erroneously claiming that the input value violated the constraint:

Traceback (most recent call last):
  File "SDV/sdv/tabular/copulas.py", line 285, in sample_conditions
    conditions, 100, batch_size, randomize_samples, output_file_path)
  File "SDV/sdv/tabular/base.py", line 715, in _sample_conditions
    handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
  File "SDV/sdv/tabular/utils.py", line 175, in handle_sampling_error
    raise sampling_error
  File "SDV/sdv/tabular/base.py", line 711, in _sample_conditions
    batch_size_per_try,
  File "SDV/sdv/tabular/utils.py", line 224, in check_num_rows
    raise ValueError(user_msg)
ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model.
Please try again with a different set of values.

What should happen

The function call is expected to work without issue for all valid integer inputs when constraints are specified as integer types.

Steps to reproduce

import sdv 
from sdv.tabular import GaussianCopula
from sdv.constraints import Between
from sdv.sampling.tabular import Condition
import pandas as pd
import numpy as np

# The high and low values alter which exact integers are unstable
# Another script can be used to search for unstable conversions for arbitrary low/high values
constraint_input = Between(column='input', low=49, high=100)
model = GaussianCopula(
            field_names=['input', 'output'],
            field_transformers = {'input': 'integer', # Problematic conversions may occur
                                  'output': 'float',},
            constraints=[constraint_input],
            min_value = None,
            max_value = None)
# The particular data (and amount used) do not matter, but should be present for the model to have a sampling basis
i, j = 50, 80
arbitrary_data = pd.DataFrame(data={'input': [_ for _ in range(i,j)],
                                    'output': [np.random.rand() for _ in range(j-i)]})
model.fit(arbitrary_data)
# In this case of low=49, high=100; input=88 is the only unstable value
conditions = Condition({'input': 88}, num_rows=3)
output = model.sample_conditions([conditions])

Running the above as mwe.py:

$ python mwe.py 
Sampling conditions:   0%|         | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "mwe.py", line 25, in <module>
    output = model.sample_conditions([conditions])
  File "SDV/sdv/tabular/copulas.py", line 285, in sample_conditions
    conditions, 100, batch_size, randomize_samples, output_file_path)
  File "SDV/sdv/tabular/base.py", line 715, in _sample_conditions
    handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
  File "SDV/sdv/tabular/utils.py", line 175, in handle_sampling_error
    raise sampling_error
  File "SDV/sdv/tabular/base.py", line 711, in _sample_conditions
    batch_size_per_try,
  File "SDV/sdv/tabular/utils.py", line 224, in check_num_rows
    raise ValueError(user_msg)
ValueError: Unable to sample any rows for the given conditions. This may be because the provided values are out-of-bounds in the current model. 
Please try again with a different set of values.

In the particular example above, debugging can show that during the reverse transformation, the interval floating point value is 87.9999999 rather than 88.0, so casting to an integer demotes it to 87 and makes all sampled rows fail to pass the constraint.

Known Workarounds

I believe the issue should be fixed for SDV's benefit, but in the meantime it is possible to get around the issue by avoiding SDV's integer conversions. However, it would be ideal if the numerical stability was reliable on integer values as expected.

tlranda commented 2 years ago

I can also replicate this using floating point values when strict=False on the constraint if the lower bound is strictly equal to the constrained value. Adding np.finfo(float).eps will push a low value back within the constraint bounds for SDV's validation check--this may also be at play in the integer case above, but cannot be naively applied to all constrained inputs without adverse affects on upper bound values.

As far as I've been able to determine, SDV always accepts/rejects constrained input that is equal to the upper constraint value as expected according to the truthiness of strict.

jaehoonkoo commented 2 years ago

Is there any reason for the data = data * 0.95 + 0.025 and the logit data = np.log(data / (1.0 - data)) in transformation and reverse transform for Between constraints? https://github.com/sdv-dev/SDV/blob/v0.14.1/sdv/constraints/tabular.py#L876-L877

amontanez24 commented 2 years ago

@tlranda It seems that during the reverse_transform for the Between constraint, the condition value (ie. 88) is being reconstructed and sometimes it is a little off. In this case we get 87.999999999 instead of 88. Then after that when the columns are converted back to their original dtypes and it is casted as an int it goes down to 87.

This is definitely a bug that we can fix for the next release. Another work around for now would be to use reject_sampling as the handling_strategy for the constraint.

jaehoonkoo commented 2 years ago

@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type. https://github.com/jaehoonkoo/SDV/blob/v0.14.1-fix_btw/sdv/metadata/table.py#L716-L721

amontanez24 commented 2 years ago

@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type. https://github.com/jaehoonkoo/SDV/blob/v0.14.1-fix_btw/sdv/metadata/table.py#L716-L721

Yeah this is pretty much the solution I was considering.

tlranda commented 2 years ago

Another work around for now would be to use reject_sampling as the handling_strategy for the constraint.

From what I've seen in testing, reject_sampling will fail to generate any rows as they all suffer from the bug. I tried @jaehoonkoo's change and that does appear to work as a fix for SDV itself.

amontanez24 commented 2 years ago

I was able to generate rows using reject_sampling as the handling strategy. It shouldn't fail from that bug since the condition value never gets altered. It might fail on occasion with the same error message though, but that would be because it had a hard time generating the desired number of rows. I was able to do it with the following code snippet

constraint_input = Between(column='input', low=49, high=100, handling_strategy='reject_sampling')
model = GaussianCopula(
              field_names=['input', 'output'],
              field_transformers = {'input': 'integer', # Problematic conversions may occur
                                  'output': 'float',},
              constraints=[constraint_input],
              min_value = None,
              max_value = None)
  # The particular data (and amount used) do not matter, but should be present for the model to have a sampling basis
  i, j = 50, 80
  arbitrary_data = pd.DataFrame(data={'input': [_ for _ in range(i,j)],
                                      'output': [np.random.rand() for _ in range(j-i)]})
  model.fit(arbitrary_data)

  # In this case of low=49, high=100; input=88 is the only unstable value
  conditions = Condition({'input': 88}, num_rows=3)
  output = model.sample_conditions([conditions])
tlranda commented 1 year ago

This issue has been re-introduced with ScalarRanges, perhaps other constraint types as well. I have replicated it in versions 1.1.0 and 1.2.0, but did not track the exact commit it was re-introduced in.

Simple MWE for the bug's return in SDV 1.2.0 commit 750332ca2dfc58517e88680d37d1e1cbb9e5b819

import sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition
from sdv.metadata import SingleTableMetadata
import pandas as pd, numpy as np

arbitrary_data = pd.DataFrame({'x': [1,3,6], 'y': [3.,6.,9.]})
meta = SingleTableMetadata()
meta.detect_from_dataframe(data)

lo = 0
hi = 100
my_constraint = {
    'constraint_class': 'ScalarRange',
    'constraint_parameters': {
       'column_name': 'x',
       'low_value': lo,
       'high_value': hi,
       'strict_boundaries': False
    }
}

model = GaussianCopulaSynthesizer(meta, enforce_min_max_values=False)
model.add_constraints(constraints=[my_constraint])
model.fit(arbitrary_data)

n_sampled = {}
good_keys = []
bad_keys = []
tests = np.arange(int(lo), int(hi+1))
# These values will be numerically unstable for the given arbitrary data/constraints -- changing the setup will change which indices are unstable
# tests = [5, 7, 9, 12, 15, 19, 21, 22, 29, 32, 36, 49, 57, 58, 65, 83, 89, 96, 100]
for _ in tests:
    condition = [Condition(num_rows=10, column_values={'x': _})]
    try:
        n_sampled[_] = len(model.sample_from_conditions(condition))
    except:
        n_sampled[_] = 0
    if n_sampled[_] == 0:
        bad_keys.append(_)
    else:
        good_keys.append(_)
if len(bad_keys) > len(good_keys):
    print("Bad outcome. Good keys?", good_keys)
else:
    print("Good outcome. Bad keys?", bad_keys)

The fix remains the same as before: integer datatypes MUST be rounded before casting to prevent numerical instability from truncating condition values, thus rejecting all conditionally sampled data.

Example fix:

# Line 1186
else:
+   if self._dtype in [int, np.int64, np.int32]:
+       data = data.round(0)
    table_data[self._column_name] = data.astype(self._dtype)

Permalink to portion of affected code for above: https://github.com/sdv-dev/SDV/blob/750332ca2dfc58517e88680d37d1e1cbb9e5b819/sdv/constraints/tabular.py#L1186

amontanez24 commented 1 year ago

@tlranda Thank you for bringing this up. I was able to replicate as well. Reopening for now