sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

HMASynthesizer does not work with demo multitable dataset `Rossmann`, the data does not match the metadata #1777

Closed martinjurkovic closed 6 months ago

martinjurkovic commented 7 months ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

Can't fit HMA for Rossmann multitable demo dataset.

Error message:

InvalidDataError: The provided data does not match the metadata:
Table: 'historical'
Error: Invalid values found for datetime column 'Date': ['2013-01-01', '2013-01-02', '2013-01-03', '+ 939 more'].

Table: 'store'
Error: Invalid values found for boolean column 'Promo2': [0, 1].

Steps to reproduce

from sdv.datasets.demo import get_available_demos, download_demo

dataset_name = 'rossmann'
tables, metadata = download_demo('multi_table', 'rossmann', output_folder_name='data/downloads/rossmann')

from sdv.multi_table import HMASynthesizer

model = HMASynthesizer(metadata)
model.fit(tables)
martinjurkovic commented 7 months ago

The problems are the following:

For the datetime column Date, the format in the metadata is wrong. Right now it is %d/%m/%y but it should be %Y-%m-%d.

For the Promo2 column, the problem is that in SingleTableMetadata when reading boolean type column the numerical values are not supported. https://github.com/sdv-dev/SDV/blob/74baae90eb64abf52a5ea3e55b2017ef849fec6d/sdv/metadata/single_table.py#L903-L906

npatki commented 7 months ago

Hi @martinjurkovic thanks for letting us know and filing this issue. We can keep this open and update the issue once we update the S3 bucket.

FYI for a quick way to determine whether the metadata matches the data, you can use the following command:

metadata.validate_data(data)

Workround

In the meantime, please feel free to update the invalid columns locally to continue on with this dataset. The following should work:

metadata.update_column(
    table_name='store',
    column_name='Promo2',
    sdtype='categorical'
)

metadata.update_column(
    table_name='historical',
    column_name='Date',
    sdtype='datetime',
    datetime_format='%Y-%m-%d'
)

metadata.validate()
metadata.validate_data(data)

Additional Context

BTW there are a few other datasets that are running into issues due to the metadata. See

npatki commented 6 months ago

Hi @martinjurkovic, this issue has now been fixed. No need to update your SDV version -- it should now work if you re-run download_demo. Let us know if you are still having problems with this. Thanks.