sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

PARSynthesizer errors during `fit` if sequence_index is numerical sdtype #2079

Closed frances-h closed 1 week ago

frances-h commented 2 weeks ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

After #2043, we fixed an issue where enforce_min_max_values was by default being set to True for the sequence_index transformer. However, if no transformer is assigned to the sequence_index (i.e. if the sequence is already a numerical sdtype), fit now errors.

To fix, we should check that (1) a transformer has been assigned (transformer is not None) and (2) that the transformer has the enforce_min_max_values attribute (instead of adding an additional check, we could use getattr with a False default value in place of directly accessing the attribute)

Steps to reproduce

from sdv.datasets.demo import download_demo
from sdv.sequential import PARSynthesizer

data, metadata = download_demo('sequential', 'CMAPPSJetEngine')
s1 = PARSynthesizer(metadata)
s1.fit(data)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[4], line 1
----> 1 s1.fit(data1)
      2 s1.sample(10)

File ~/Documents/SDV/sdv/single_table/base.py:471, in BaseSynthesizer.fit(self, data)
    469 self._data_processor.reset_sampling()
    470 self._random_state_set = False
--> 471 processed_data = self.preprocess(data)
    472 self.fit_processed_data(processed_data)

File ~/Documents/SDV/sdv/single_table/base.py:407, in BaseSynthesizer.preprocess(self, data)
    400     warnings.warn(
    401         'This model has already been fitted. To use the new preprocessed data, '
    402         "please refit the model using 'fit' or 'fit_processed_data'."
    403     )
    405 is_converted = self._store_and_convert_original_cols(data)
--> 407 preprocess_data = self._preprocess(data)
    409 if is_converted:
    410     data.columns = self._original_columns

File ~/Documents/SDV/sdv/sequential/par.py:286, in PARSynthesizer._preprocess(self, data)
    284 sequence_key_transformers = {sequence_key: None for sequence_key in self._sequence_key}
    285 if not self._data_processor._prepared_for_fitting:
--> 286     self.auto_assign_transformers(data)
    288 self.update_transformers(sequence_key_transformers)
    289 preprocessed = super()._preprocess(data)

File ~/Documents/SDV/sdv/sequential/par.py:267, in PARSynthesizer.auto_assign_transformers(self, data)
    265 if self._sequence_index:
    266     sequence_index_transformer = self.get_transformers()[self._sequence_index]
--> 267     if sequence_index_transformer.enforce_min_max_values:
    268         sequence_index_transformer.enforce_min_max_values = False

AttributeError: 'NoneType' object has no attribute 'enforce_min_max_values'
ryantimjohn commented 6 days ago

Is there any workaround end-users can do to get around this in the meantime till this release drops @lajohn4747 ?

npatki commented 6 days ago

Hi @ryantimjohn, sure thing. The bug only appears when sequence_index is a numerical sdtype, but it works just fine if the sdtype is datetime. So one workaround would be to convert your numerical column into datetimes. In the example below, I am converting a numerical column to datetimes by adding the # of days to Jan 1, 2000:

import pandas as pd
from sdv.sequential import PARSynthesizer

index_name = 'COLUMN_NAME' # replace with the name of your numerical sequence index column

# convert the sequence index to datetime and update metadata to match
data[index_name] = pd.to_datetime('2000-01-01') + pd.to_timedelta(data[index_name], unit='d')
metadata.update_column(
    column_name=index_name,
    sdtype='datetime'
)

# now you can model and sample synthetic data using PAR
synthesizer = PARSynthesizer(metadata)
synthesizer.fit(data)
synthetic_data = synthesizer.sample(num_sequences=10)

# be sure to convert the datetimes back into numbers
synthetic_data[index_name] = synthetic_data[index_name] - pd.to_datetime('2000-01-01')

This is a bit hacky, but after the next release, you will not need to apply the workaround. Hope that helps!

ryantimjohn commented 5 days ago

Came up with the same solution, thank you!

ryantimjohn commented 5 days ago

@npatki Unfortunately, when I did this, though, I got another error, sorry to ask for help troubleshooting but wondered if you could help because I saw you dealt with a similar error here: https://github.com/sdv-dev/SDV/issues/1214

After converting the sequence index column to a date and update the metadata, when I run the dataframe through the PAR Synthesizer, I get this error: AttributeError: 'NoneType' object has no attribute 'is_generator'

Is there any reason why this might be that comes to mind?

Thanks very much for your help!

Full stack trace:

Cell In[15], line 5
      1 from sdv.sequential import PARSynthesizer
      2 synthesizer = PARSynthesizer(
      3     modified_metadata,context_columns=context_columns,enforce_min_max_values=False,
      4         verbose=True)
----> 5 synthesizer.fit(modified_data)

File /opt/conda/lib/python3.10/site-packages/sdv/single_table/base.py:460, in BaseSynthesizer.fit(self, data)
    458 self._data_processor.reset_sampling()
    459 self._random_state_set = False
--> 460 processed_data = self.preprocess(data)
    461 self.fit_processed_data(processed_data)

File /opt/conda/lib/python3.10/site-packages/sdv/single_table/base.py:396, in BaseSynthesizer.preprocess(self, data)
    389     warnings.warn(
    390         'This model has already been fitted. To use the new preprocessed data, '
    391         "please refit the model using 'fit' or 'fit_processed_data'."
    392     )
    394 is_converted = self._store_and_convert_original_cols(data)
--> 396 preprocess_data = self._preprocess(data)
    398 if is_converted:
    399     data.columns = self._original_columns

File /opt/conda/lib/python3.10/site-packages/sdv/sequential/par.py:280, in PARSynthesizer._preprocess(self, data)
    277 if not self._data_processor._prepared_for_fitting:
    278     self.auto_assign_transformers(data)
--> 280 self.update_transformers(sequence_key_transformers)
    281 preprocessed = super()._preprocess(data)
    283 if self._sequence_index:

File /opt/conda/lib/python3.10/site-packages/sdv/sequential/par.py:303, in PARSynthesizer.update_transformers(self, column_name_to_transformer)
    299 if set(column_name_to_transformer).intersection(set(self.context_columns)):
    300     raise SynthesizerInputError(
    301         'Transformers for context columns are not allowed to be updated.')
--> 303 super().update_transformers(column_name_to_transformer)

File /opt/conda/lib/python3.10/site-packages/sdv/single_table/base.py:228, in BaseSynthesizer.update_transformers(self, column_name_to_transformer)
    226 self._validate_transformers(column_name_to_transformer)
    227 self._warn_for_update_transformers(column_name_to_transformer)
--> 228 self._data_processor.update_transformers(column_name_to_transformer)
    229 if self._fitted:
    230     msg = 'For this change to take effect, please refit the synthesizer using `fit`.'

File /opt/conda/lib/python3.10/site-packages/sdv/data_processing/data_processor.py:652, in DataProcessor.update_transformers(self, column_name_to_transformer)
    646     raise NotFittedError(
    647         'The DataProcessor must be prepared for fitting before the transformers can be '
    648         'updated.'
    649     )
    651 for column, transformer in column_name_to_transformer.items():
--> 652     if column in self._keys and not transformer.is_generator():
    653         raise SynthesizerInputError(
    654             f"Invalid transformer '{transformer.__class__.__name__}' for a primary "
    655             f"or alternate key '{column}'. Please use a generator transformer instead."
    656         )
    658 with warnings.catch_warnings():
npatki commented 5 days ago

Hi @ryantimjohn no problem. I suspect this is unrelated to to the fit error and has something to do with the data/metadata itself. Would you mind filing a new bug with this info? It would be helpful to if you could also share the (updated) metadata and perhaps an example of the data itself that has the datetime index column. Thanks!