Open tlranda opened 2 years ago
I can also replicate this using floating point values when strict=False
on the constraint if the lower bound is strictly equal to the constrained value. Adding np.finfo(float).eps
will push a low value back within the constraint bounds for SDV's validation check--this may also be at play in the integer case above, but cannot be naively applied to all constrained inputs without adverse affects on upper bound values.
As far as I've been able to determine, SDV always accepts/rejects constrained input that is equal to the upper constraint value as expected according to the truthiness of strict
.
Is there any reason for the data = data * 0.95 + 0.025
and the logit data = np.log(data / (1.0 - data)) in transformation and reverse transform for Between constraints?
https://github.com/sdv-dev/SDV/blob/v0.14.1/sdv/constraints/tabular.py#L876-L877
@tlranda It seems that during the reverse_transform
for the Between
constraint, the condition value (ie. 88) is being reconstructed and sometimes it is a little off. In this case we get 87.999999999 instead of 88. Then after that when the columns are converted back to their original dtypes and it is casted as an int it goes down to 87.
This is definitely a bug that we can fix for the next release. Another work around for now would be to use reject_sampling
as the handling_strategy
for the constraint.
@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type. https://github.com/jaehoonkoo/SDV/blob/v0.14.1-fix_btw/sdv/metadata/table.py#L716-L721
@amontanez24, I made a workaround in sdv/metadata/table.py applying round() before astype() for integer type. https://github.com/jaehoonkoo/SDV/blob/v0.14.1-fix_btw/sdv/metadata/table.py#L716-L721
Yeah this is pretty much the solution I was considering.
Another work around for now would be to use
reject_sampling
as thehandling_strategy
for the constraint.
From what I've seen in testing, reject_sampling
will fail to generate any rows as they all suffer from the bug. I tried @jaehoonkoo's change and that does appear to work as a fix for SDV itself.
I was able to generate rows using reject_sampling
as the handling strategy. It shouldn't fail from that bug since the condition value never gets altered. It might fail on occasion with the same error message though, but that would be because it had a hard time generating the desired number of rows. I was able to do it with the following code snippet
constraint_input = Between(column='input', low=49, high=100, handling_strategy='reject_sampling')
model = GaussianCopula(
field_names=['input', 'output'],
field_transformers = {'input': 'integer', # Problematic conversions may occur
'output': 'float',},
constraints=[constraint_input],
min_value = None,
max_value = None)
# The particular data (and amount used) do not matter, but should be present for the model to have a sampling basis
i, j = 50, 80
arbitrary_data = pd.DataFrame(data={'input': [_ for _ in range(i,j)],
'output': [np.random.rand() for _ in range(j-i)]})
model.fit(arbitrary_data)
# In this case of low=49, high=100; input=88 is the only unstable value
conditions = Condition({'input': 88}, num_rows=3)
output = model.sample_conditions([conditions])
This issue has been re-introduced with ScalarRanges, perhaps other constraint types as well. I have replicated it in versions 1.1.0 and 1.2.0, but did not track the exact commit it was re-introduced in.
Simple MWE for the bug's return in SDV 1.2.0 commit 750332ca2dfc58517e88680d37d1e1cbb9e5b819
import sdv
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.sampling import Condition
from sdv.metadata import SingleTableMetadata
import pandas as pd, numpy as np
arbitrary_data = pd.DataFrame({'x': [1,3,6], 'y': [3.,6.,9.]})
meta = SingleTableMetadata()
meta.detect_from_dataframe(data)
lo = 0
hi = 100
my_constraint = {
'constraint_class': 'ScalarRange',
'constraint_parameters': {
'column_name': 'x',
'low_value': lo,
'high_value': hi,
'strict_boundaries': False
}
}
model = GaussianCopulaSynthesizer(meta, enforce_min_max_values=False)
model.add_constraints(constraints=[my_constraint])
model.fit(arbitrary_data)
n_sampled = {}
good_keys = []
bad_keys = []
tests = np.arange(int(lo), int(hi+1))
# These values will be numerically unstable for the given arbitrary data/constraints -- changing the setup will change which indices are unstable
# tests = [5, 7, 9, 12, 15, 19, 21, 22, 29, 32, 36, 49, 57, 58, 65, 83, 89, 96, 100]
for _ in tests:
condition = [Condition(num_rows=10, column_values={'x': _})]
try:
n_sampled[_] = len(model.sample_from_conditions(condition))
except:
n_sampled[_] = 0
if n_sampled[_] == 0:
bad_keys.append(_)
else:
good_keys.append(_)
if len(bad_keys) > len(good_keys):
print("Bad outcome. Good keys?", good_keys)
else:
print("Good outcome. Bad keys?", bad_keys)
The fix remains the same as before: integer datatypes MUST be rounded before casting to prevent numerical instability from truncating condition values, thus rejecting all conditionally sampled data.
Example fix:
# Line 1186
else:
+ if self._dtype in [int, np.int64, np.int32]:
+ data = data.round(0)
table_data[self._column_name] = data.astype(self._dtype)
Permalink to portion of affected code for above: https://github.com/sdv-dev/SDV/blob/750332ca2dfc58517e88680d37d1e1cbb9e5b819/sdv/constraints/tabular.py#L1186
@tlranda Thank you for bringing this up. I was able to replicate as well. Reopening for now
Environment Details
Please indicate the following details about the environment in which you found the bug:
Error Description
When using integer values in a constrained GaussianCopula, numerical instability can cause valid inputs to fail to produce any rows with the
sample_conditions()
call. This doesn't happen for all inputs, but the bug can be reliably produced for some value(s) in the range of valid inputs for virtually anyBetween
constraint instance. I do not believe the problem is strictly isolated toBetween
, but I have not attempted to reproduce the error for other constraint classes.What happens
Given a particular
Between
constraint bounded bylow
andhigh
, conditional sampling of some values such thatlow < x < high
fails to produce any values, but other valid inputs work as expected. The following exception traceback will be generated, erroneously claiming that the input value violated the constraint:What should happen
The function call is expected to work without issue for all valid integer inputs when constraints are specified as integer types.
Steps to reproduce
Running the above as
mwe.py
:In the particular example above, debugging can show that during the reverse transformation, the interval floating point value is 87.9999999 rather than 88.0, so casting to an integer demotes it to 87 and makes all sampled rows fail to pass the constraint.
Known Workarounds
I believe the issue should be fixed for SDV's benefit, but in the meantime it is possible to get around the issue by avoiding SDV's integer conversions. However, it would be ideal if the numerical stability was reliable on integer values as expected.
Using the same values in floating point format for each of the constraint, sampling input, and training data avoids the integer conversion/numerical instability as it is smaller than
rtol
's default value.Instead of using integers, using floating point values
low=0
,high=1
and doing the appropriate data conversions (ie:x = x*(high-low)+low
andx = (x-low)/(high-low)
, wherehigh
andlow
are the desired integer constraint bounds) before modeling/requesting samples and after sampling can get around the issue.