sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
198 stars 44 forks source link

LSTMDetection throws a ValueError on date column #422

Open mohammedsabiya opened 11 months ago

mohammedsabiya commented 11 months ago

Environment Details

Error Description

I have an error when I try to use Detection: Sequential metrics to evaluate the real data and synthetic data. The error is here as follows:

/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py:40: FutureWarning:

In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py:40: FutureWarning:

In a future version of pandas, a length 1 tuple will be returned when iterating over a groupby with a grouper equal to a list of length 1. Don't supply a list with a single grouper to avoid this warning.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-53-f6c26dfeeb5a>](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in <cell line: 3>()
      1 from sdmetrics.timeseries import LSTMDetection
      2 
----> 3 LSTMDetection.compute(
      4     real_data=training_data_ref,
      5     synthetic_data=synthetic_data,

5 frames
[/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in compute(cls, real_data, synthetic_data, metadata, sequence_key)
     82 
     83         real_x = cls._build_x(real_data, ht, sequence_key)
---> 84         synt_x = cls._build_x(synthetic_data, ht, sequence_key)
     85 
     86         X = pd.concat([real_x, synt_x])

[/usr/local/lib/python3.10/dist-packages/sdmetrics/timeseries/detection.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in _build_x(data, hypertransformer, sequence_key)
     40         for entity_id, entity_data in data.groupby(sequence_key):
     41             entity_data = entity_data.drop(sequence_key, axis=1)
---> 42             entity_data = hypertransformer.transform(entity_data)
     43             entity_data = pd.Series({
     44                 column: entity_data[column].to_numpy()

[/usr/local/lib/python3.10/dist-packages/sdmetrics/utils.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in transform(self, data)
    198                 # Categorical column.
    199                 col_data = pd.DataFrame({'field': data[field]})
--> 200                 out = transform_info['one_hot_encoder'].transform(col_data).toarray()
    201                 transformed = pd.DataFrame(
    202                     out, columns=[f'value{i}' for i in range(np.shape(out)[1])])

[/usr/local/lib/python3.10/dist-packages/sklearn/utils/_set_output.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in wrapped(self, X, *args, **kwargs)
    138     @wraps(f)
    139     def wrapped(self, X, *args, **kwargs):
--> 140         data_to_wrap = f(self, X, *args, **kwargs)
    141         if isinstance(data_to_wrap, tuple):
    142             # only wrap the first output for cross decomposition

[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in transform(self, X)
    915             "infrequent_if_exist",
    916         }
--> 917         X_int, X_mask = self._transform(
    918             X,
    919             handle_unknown=self.handle_unknown,

[/usr/local/lib/python3.10/dist-packages/sklearn/preprocessing/_encoders.py](https://62qgp8t4qd-496ff2e9c6d22116-0-colab.googleusercontent.com/outputframe.html?vrz=colab-20230816-060144-RC00_557406423#) in _transform(self, X, handle_unknown, force_all_finite, warn_on_unknown)
    172                         " during transform".format(diff, i)
    173                     )
--> 174                     raise ValueError(msg)
    175                 else:
    176                     if warn_on_unknown:

ValueError: Found unknown categories ['2022-01-30 20:56:25', '2022-03-13 22:36:04', '2022-03-08 05:56:53', '2022-02-18 19:55:38', '2022-02-06 14:56:50', '2022-01-20 06:05:25', '2022-02-20 05:22:48', '2022-02-10 21:01:33', '2022-02-13 19:49:27', '2022-02-18 16:44:19'] in column 0 during transform

I am getting the same error when using LSTMClassifierEfficacy as well.

The unknown categories that are mentioned in the error are from the date column. The type of the date column is object in both real and synthetic data.

Code Implementation

Here is the code:

from sdmetrics.timeseries import LSTMDetection

LSTMDetection.compute(
    real_data=training_data_ref,
    synthetic_data=synthetic_data,
    metadata=metadata,
    sequence_key=['id']

)

from sdmetrics.timeseries import LSTMClassifierEfficacy

LSTMClassifierEfficacy.compute(
    real_data=training_data_ref,
    synthetic_data=synthetic_data,
    metadata=metadata,
    target='combined_label'
)

Thank you :)

iamamiramine commented 2 months ago

Were you able to solve the issue?

mohammedsabiya commented 2 months ago

yes

iamamiramine commented 2 months ago

how?

Ng-ms commented 1 month ago

@mohammedsabiya can you please share with us how did you solve it

srinify commented 1 month ago

@Ng-ms @mohammedsabiya @iamamiramine

I was able to reproduce the issue and I opened a new ticket here for the team to look at: https://github.com/sdv-dev/SDMetrics/issues/584

I will close this issue out for now and mark as Duplicate of https://github.com/sdv-dev/SDMetrics/issues/584 -- we can focus our discussion in the new issue.

Can you folks try the following workaround to see if that resolves the issue?

Suggested Workaround

For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():

df['date_col_1'] = pd.to_datetime(df['date_col_1'])