sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
201 stars 45 forks source link

Metric 'Detection: Sequential' gives ValueError for datetime column #580

Closed Scit3ch closed 2 months ago

Scit3ch commented 3 months ago

Environment Details

Error Description

Using the Detection: Sequential metric I get the following error: ValueError: Found unknown categories ['2023-11-12 12:52:00', '2024-01-15 02:00:30' ... in lines 84 and 42 of file ...\sdmetrics\timeseries\detection.py and line 213 of ...\sklearn\preprocessing\_encoders.py

The metadata for this column is "date": { "sdtype": "datetime", "datetime_format": "%Y-%m-%d %H:%M:%S" },

Steps to reproduce

As the original data is confidential I can't share it here. However, the following steps lead to the described error:. So maybe any dataset with a datetime column would be sufficient to reproduce the error.

synthesizer = PARSynthesizer( metadata, enforce_min_max_values=True, enforce_rounding=False, epochs=100, context_columns=['ContextColumn1'], verbose=True )

synthesizer.fit(real_data)

synthetic_data = synthesizer.sample( num_sequences=20, sequence_length=None )

LSTMDetection.compute(
    real_data=real_data,
    synthetic_data=synthetic_data,
    metadata=metadata,
    sequence_key='user_id'
)

Temporary solution

The issue is due to the default behavior of the OneHotEncoder of the sklearn package, which throws an error in case of values in the synthetic data which were not present in the training data. Changing line 161 of file \Lib\site-packages\sdmetrics\utils.py from enc = OneHotEncoder() to enc = OneHotEncoder(handle_unknown='ignore') solves this issue.

So I would suggest:

  1. give the user the option to define this parameter already in LSTMDetection.compute or
  2. changing the default OneHotEncoder creation in utils.py, because this issue will probably occur in all datasets with a datetime.
Ng-ms commented 3 months ago

hi @Scit3ch thank you for providing the solution, however once I apply this solution I am getting this error TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' do you have any idea how to fix this

Scit3ch commented 3 months ago

@Ng-ms My guess would be that you have null/nan ('isnan') values somewhere in your data, but really hard to say without more information. As this has probably nothing to do with the bug I'm describing here maybe update your SDV, SDVMetrics and Python version first and if the error is still present open a new issue giving more information how and where this error occurs.

srinify commented 3 months ago

Hi @Scit3ch thanks for filing this excellent bug report! I was able to reproduce the issue and I opened a new ticket here for the team to look at: https://github.com/sdv-dev/SDMetrics/issues/584

I will close this issue out for now and mark as Duplicate of #584 -- we can focus our discussion in the new issue.

Can you try the following workaround to see if that resolves the issue?

Suggested Workaround

For now, manually cast your datetime columns to the datetime dtype before using LSTMDetection. One quick way is using pandas.to_datetime():

df['date_col_1'] = pd.to_datetime(df['date_col_1'])
Scit3ch commented 2 months ago

Hi @srinify Thank you for the provided workaround which works.

One more thing: I've noticed that for the Metric 'StatisticSimilarity' one also needs to manually cast datetime columns when they were read via the 'load_csvs' function (maybe because this function reads datetime columns as object columns and only gets the required information via the metadata file). As this metric doesn't receive metadata information a meaningful error message would be helpful here too.

npatki commented 2 months ago

Hi @Scit3ch thanks for reporting. Most of the metrics in SDMetrics don't receive metadata information and don't do any conversions. They expect the inputted data to be in the correct format already (eg. datetimes converted from string to datetime). I agree a meaningful error message would be good there.

Could you file a separate issue for this so we can resolve it separately?

For reference, here are the docs for statistic similarity. We will add clarifications there too.