sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.23k stars 293 forks source link

Cannot apply `Inequality` constraint on `datetime` columns with missing values #1121

Closed npatki closed 1 year ago

npatki commented 1 year ago

Environment Details

Error Description

If there are 2 datetime columns involved in an Inequality constraint and those columns contain missing data, then the software crashes before I can get any synthetic data. Exactly where and how it crashes is dependent on the method I use to supply the constraint.

Expected Behavior: Datetime columns in an Inequality constraint should behave the same as numerical columns. That is: The inequality comparison only happens when both values are non-missing. Otherwise, we can ignore the comparison and just proceed with modeling.

Steps to reproduce

METHOD 1: Using an Inequality constraint object. We can do this when the columns are represented as datetime dtypes.

import pandas as pd
import numpy as np
from sdv.constraints import Inequality
from sdv.tabular import GaussianCopula

data = pd.DataFrame(data={
    'numerical': [0, 1, 2, 3, 4],
    'datetime_low': ['2020-01-01', '2020-03-01', np.nan, '2020-04-14', np.nan],
    'datetime_high': ['2022-03-04', np.nan, '2022-09-12', '2022-03-01', np.nan]
})

# convert to datetime so it gets recognized as a datetime object
data['datetime_low'] = pd.to_datetime(data['datetime_low'])
data['datetime_high'] = pd.to_datetime(data['datetime_high'])

constraint = Inequality(
    low_column_name='datetime_low',
    high_column_name='datetime_high'
)

model = GaussianCopula(constraints=[constraint])
model.fit(data)
model.sample(10)

Output:

[/usr/local/lib/python3.7/dist-packages/pandas/core/dtypes/cast.py](https://localhost:8080/#) in astype_float_to_int_nansafe(values, dtype, copy)
   1212     if not np.isfinite(values).all():
   1213         raise IntCastingNaNError(
-> 1214             "Cannot convert non-finite values (NA or inf) to integer"
   1215         )
   1216     return values.astype(dtype, copy=copy)

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

METHOD 2: Inputting the constraint into the metadata itself. This is required when the datetime are represented as strings.

import pandas as pd
import numpy as np
from sdv.tabular import GaussianCopula

data = pd.DataFrame(data={
    'numerical': [0, 1, 2, 3, 4],
    'datetime_low': ['2020-01-01', '2020-03-01', np.nan, '2020-04-14', np.nan],
    'datetime_high': ['2022-03-04', np.nan, '2022-09-12', '2022-03-01', np.nan]
})

metadata = {
    'fields': {
        'numerical': { 'type': 'numerical', 'subtype': 'integer' },
        'datetime_low': { 'type': 'datetime', 'format': '%Y-%m-%d' },
        'datetime_high': { 'type': 'datetime', 'format': '%Y-%m-%d' }
    },
    'constraints': [{
        'constraint': 'sdv.constraints.tabular.Inequality',
        'high_column_name': 'datetime_high',
        'low_column_name': 'datetime_low'
    }]
}

model = GaussianCopula(table_metadata=metadata)
model.fit(data)

Output

[/usr/local/lib/python3.7/dist-packages/sdv/metadata/table.py](https://localhost:8080/#) in _fit_constraints(self, data)
    440 
    441         if errors:
--> 442             raise MultipleConstraintsErrors('\n' + '\n\n'.join(map(str, errors)))
    443 
    444     def _transform_constraints(self, data, is_condition=False):

MultipleConstraintsErrors: 
ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
npatki commented 1 year ago

Workaround

Although this is not ideal, the workaround for now would be to drop any rows where either of the low or high columns are missing

# replace with the name of your datetime columns
training_data = real_data.dropna(subset=['datetime_low', 'datetime_high'])

model.fit(training_data)

This issue should be fixed in an upcoming release of the SDV.

npatki commented 1 year ago

This issue has now been resolved in the new, SDV 1.0 library.