sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.28k stars 300 forks source link

sampling a HMA1 model with Inequality constraints on Date Fields give diffrent outputs #1221

Closed mounirHai closed 1 year ago

mounirHai commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

when using sample method of HMA1 model with Inequality constraints on Date Fields I het 2 diffrent results depending on the sampel size:

Steps to reproduce

see https://github.com/sdv-dev/SDV/issues/1191

attached are the output of the 2 
[sampel5000.log](https://github.com/sdv-dev/SDV/files/10557556/sampel5000.log)
[sample_500.log](https://github.com/sdv-dev/SDV/files/10557559/sample_500.log)
sample calls.
npatki commented 1 year ago

Hi @mounirHai, thanks for filing the new issue. I'm not seeing the log attachments to this issue. Would you mind adding them?

Some next steps to help us replicate and investigate:

  1. Would you be able to share the full metadata for all the tables that are involved in the HMA1 model?
  2. Can you confirm whether you encounter this issue every time you try to sample, or is this issue intermittent? (There is some randomness to the sampling logic.)
  3. Can you confirm whether you are encountering this issue if you remove the Inequality constraint? I'm trying to determine if this is issue is specifically due to the constraint or if it's something else.
mounirHai commented 1 year ago

Hi again!

  1. bellow is the metadata:

    { "tables": { "patient":{ "fields": { "patientNr": {"type": "id", "subtype": "integer"}, "sex": {"type": "categorical"}, "ageGrp": {"type": "categorical"} }, "primary_key": "patientNr" }, "stay": {
    "fields": { "stayID": {"type": "id", "subtype": "integer"}, "referalID": {"ref": {"field": "referalID", "table": "referal"}, "type": "id", "subtype": "integer"}, "stayDate": {"type": "datetime", "format": "%Y-%m-%d"}, "stayLength":{"type": "numerical", "subtype": "integer"} , "spesialist": {"type": "categorical"}, "rolleGrp": {"type": "categorical"}, "polUtforendeGrp": {"type": "categorical"}, "kontaktTypeGrp": {"type": "categorical"}, "MainDiagGrp": {"type": "categorical"}}, "primary_key": "stayID" } , "referal": { "fields": { "patientNr": {"ref": {"field": "patientNr", "table": "patient"}, "type": "id", "subtype": "integer"}, "referalID": {"type": "id", "subtype": "integer"}, "referalInnDate" : {"type": "datetime", "format": "%Y-%m-%d"}, "referalOutDate" : {"type": "datetime", "format": "%Y-%m-%d"} }, "constraints": [{"constraint": "sdv.constraints.tabular.Inequality", "low_column_name": "referalInnDate","high_column_name":"referalOutDate"}], "primary_key": "referalID" } } }

  2. There is actually some randomness indeed, but I feel that sampling high number of rows does throw the error with Inequality constraint on date variables.

  3. Sampling without Inuequality constraint on dates goes fine.

npatki commented 1 year ago

Thanks @mounirHai, I can confirm that the metadata looks ok.

Given the randomness, it seems like this may be a tough one to try to replicate. At the moment, I'm not so sure what may be the root cause of this.

Were you able to locate the error log? It will be helpful if you could attach it here. Or alternatively, if you can copy/paste the stack trace (ie everything this is printed out when you get the error).

mounirHai commented 1 year ago

below is the log trace when sampling with Inequality constraint on dates:

log.txt

mounirHai commented 1 year ago

when sampling 500 rows with Inequality constraint on dates (No error thrown). I did a print of the Inequality constraint reverse_transform method: print_output.txt

npatki commented 1 year ago

Thank you, this is very helpful! To save some space, I converted your log and print outputs into text files and attached them.

Seems like the offending line is in the Inequality constraint's reverse_transform:

table_data[self._high_column_name] = pd.Series(diff_column + low).astype(self._dtype)

It appears the addition can cause an overflow error if the difference value is too high. For eg the code below creates the same error:

fake_diff = pd.Series([pd.Timedelta(days=100000)])
fake_low_column = pd.Series([pd.to_datetime('2016-12-08T00:00:00.000000000')])

print(fake_diff + fake_low_column)

I'm curious what the two columns in your Inequality constraint represent. Are there large differences between them?

Quick Fix: The HMA may be generating large diff values with low probability. It may help to fit the model using a bounded distribution such as truncated gaussian.

model = HMA1(
    metadata=metadata,
    model_kwargs={ 'default_distribution': 'truncated_gaussian' }
)
mounirHai commented 1 year ago

Hei again! The referral duration can span over quiet a long period , so yes! there is a large difference between the InnDate and the OutDate. there can be also some noise in the data with some extremely large OutDates! I did not check ! I ll try the truncated Gaussian and come back to you with the results.

mounirHai commented 1 year ago

Hi again! I tries truncated gaussian with Inequality constraint on date. when running model.fit() I guess the following warning C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\copulas\univariate\truncated_gaussian.py:46: RuntimeWarning:

divide by zero encountered in double_scalars when runnning model.sample() , I get the following error IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

the error stack is attached. error.log

npatki commented 1 year ago

Hi @mounirHai, there is a known bug #1121 with this issue that you can refer to.

mounirHai commented 1 year ago

Hi again! it can not be NA on date columns. I have checked the NA on the date fields and there is Non. this is the line from the og , Notice that it is the reverse transform method throwing the error on the the generated diff column by the modell.

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\constraints\tabular.py in _reverse_transform(self, table_data) 436 437 if self._is_datetime: --> 438 diff_column = diff_column.astype('timedelta64[ns]') 439 440

mounirHai commented 1 year ago

and when I remove the Inuequality constrain from the metada and fit the modell with a truncate_gaussian distribution, I get the following error when sampling

ValueError Traceback (most recent call last) ~\AppData\Local\Temp\3\ipykernel_636\4263793738.py in ----> 1 synthetic_data = model.sample(num_rows=1)

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\relational\base.py in sample(self, table_name, num_rows, sample_children, reset_primary_keys) 183 self._reset_primary_keys_generators() 184 --> 185 return self._sample(table_name, num_rows, sample_children) 186 187 def save(self, path):

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\relational\hma.py in _sample(self, table_name, num_rows, sample_children) 593 for table in self.metadata.get_tables(): 594 if not self.metadata.get_parents(table): --> 595 self._sample_table(table, num_rows, sampled_data=sampled_data) 596 597 return self._finalize(sampled_data)

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\relational\hma.py in _sample_table(self, table_name, num_rows, sample_children, sampled_data) 552 553 if sample_children: --> 554 self._sample_children(table_name, sampled_data, table_rows) 555 556 return sampled_data

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\relational\hma.py in _sample_children(self, table_name, sampled_data, table_rows) 427 428 child_rows = sampled_data[child_name] --> 429 self._sample_children(child_name, sampled_data, child_rows) 430 431 @staticmethod

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\relational\hma.py in _sample_children(self, table_name, sampled_data, table_rows) 424 LOGGER.info('Sampling rows from child table %s', childname) 425 for , row in table_rows.iterrows(): --> 426 self._sample_child_rows(child_name, table_name, row, sampled_data) 427 428 child_rows = sampled_data[child_name]

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\relational\hma.py in _sample_child_rows(self, table_name, parent_name, parent_row, sampled_data) 393 table_meta = self._models[table_name].get_metadata() 394 model = self._model(table_metadata=table_meta) --> 395 model.set_parameters(parameters) 396 397 table_rows = self._sample_rows(model, table_name)

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\tabular\base.py in set_parameters(self, parameters) 862 863 if self._metadata.get_dtypes(ids=False): --> 864 self._set_parameters(parameters) 865 866 def save(self, path):

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\tabular\copulas.py in _set_parameters(self, parameters) 433 """ 434 parameters = unflatten_dict(parameters) --> 435 parameters = self._rebuild_gaussian_copula(parameters) 436 437 self._model = copulas.multivariate.GaussianMultivariate.from_dict(parameters)

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\tabular\copulas.py in _rebuild_gaussian_copula(self, model_parameters) 419 covariance = model_parameters.get('covariance') 420 if covariance: --> 421 model_parameters['covariance'] = self._rebuild_correlation_matrix(covariance) 422 else: 423 model_parameters['covariance'] = [[1.0]]

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\tabular\copulas.py in _rebuild_correlation_matrix(cls, triangular_covariance) 391 correlation += np.identity(size) 392 --> 393 return cls._get_nearest_correlation_matrix(correlation).tolist() 394 395 def _rebuild_gaussian_copula(self, model_parameters):

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\sdv\tabular\copulas.py in _get_nearest_correlation_matrix(matrix) 329 Insipired by: https://stackoverflow.com/a/63131250 330 """ --> 331 eigenvalues, eigenvectors = scipy.linalg.eigh(matrix) 332 negative = eigenvalues < 0 333 identity = np.identity(len(matrix))

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\scipy\linalg\decomp.py in eigh(a, b, lower, eigvals_only, overwrite_a, overwrite_b, turbo, eigvals, type, check_finite, subset_by_index, subset_by_value, driver) 443 ''.format(driver, '", "'.join(drv_str[1:]))) 444 --> 445 a1 = _asarray_validated(a, check_finite=check_finite) 446 if len(a1.shape) != 2 or a1.shape[0] != a1.shape[1]: 447 raise ValueError('expected square "a" matrix')

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\scipy_lib_util.py in _asarray_validated(a, check_finite, sparse_ok, objects_ok, mask_ok, as_inexact) 291 raise ValueError('masked arrays are not supported') 292 toarray = np.asarray_chkfinite if check_finite else np.asarray --> 293 a = toarray(a) 294 if not objects_ok: 295 if a.dtype is np.dtype('O'):

C:\ProgramData\Anaconda3\envs\dsc_hvikt_syndata\lib\site-packages\numpy\lib\function_base.py in asarray_chkfinite(a, dtype, order) 487 if a.dtype.char in typecodes['AllFloat'] and not np.isfinite(a).all(): 488 raise ValueError( --> 489 "array must not contain infs or NaNs") 490 return a 491

ValueError: array must not contain infs or NaNs

npatki commented 1 year ago

Hi @mounirHai, I realize it's been a few months since our conversation. Were you able to get this to work? From re-reading this thread, I'm not sure which the most relevant problem is (there seem to be many different attempts).

Please note that we have recently released an SDV 1.0 version with improved APIs, workflows and bug fixes. I wonder if upgrading to the latest version of this library will fix some of these errors?

npatki commented 1 year ago

Closing this issue off as stale. Please feel free to reply if you are continuing to see problems and we can reopen the investigation.