sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
209 stars 44 forks source link

`LSTMDetection` metric crashes when there are multiple context columns #298

Closed Sanchita333 closed 5 months ago

Sanchita333 commented 1 year ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

metadata1={'fields': {'Date': {'type': 'datetime'}, 'Symbol': {'type': 'id'}, 'Open': {'type': 'numerical'}, 'Close': {'type': 'numerical'}, 'Volume': {'type': 'numerical'}, 'Sector': {'type': 'categorical'}, 'Industry': {'type': 'categorical'}, 'MarketCap': {'type': 'numerical'}}, 'entity_columns': 'Symbol', 'sequence_index': 'Date', 'context_columns': [ "MarketCap", "Sector", "Industry" ]} I am using the same example that was mentioned in Time series data generation using PAR models https://sdv.dev/SDV/user_guides/timeseries/par.html.... I am am unable to evaluate the synthetic data generated using LSTM detection.

lstmerror (1)

Steps to reproduce

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
npatki commented 1 year ago

Thanks for filing! I can replicate this on the latest SDMetrics version 0.9.0 and investigated further.

Workarounds

It appears this issue only occurs when you have 2 or more context columns in the dataset. Until we fix this, you will have to remove additional context columns from the real data, synthetic data and metadata. (As shown below.)

import copy

# Remove the 'MarketCap' and 'Sector' context columns
# Only remain context column will be 'Industry'

# Remove from metadata
metadata_copy = copy.deepcopy(metadata.to_dict())
del metadata_copy['fields']['MarketCap']
del metadata_copy['fields']['Sector']

# Remove from real and synthetic data
real_copy = real_data.drop(['MarketCap', 'Sector'], axis=1)
synthetic_copy = synthetic_data.drop(['MarketCap', 'Sector'], axis=1)

LSTMDetection.compute(
    real_data=real_copy,
    synthetic_data=synthetic_copy,
    metadata=metadata_copy
)

BTW @Sanchita333 I notice your example is using the SDV demo dataset. I am curious if you are planning to apply this to your own (private) dataset. If so, does this dataset have multiple context columns?