sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.36k stars 311 forks source link

KeyError while sampling using freshly trained PAR model #943

Closed DamianUS closed 2 years ago

DamianUS commented 2 years ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 0.16.0 Python version: 3.8.13 (default, May 8 2022, 17:48:02) \n[Clang 13.1.6 (clang-1316.0.21.2)] Operating System: Macbook Pro M1 Mac OS X 12.0.1

Error description

The key error is also being raised when trying to sample from a freshly-trained PAR model in v0.16.0.

I tried both passing the field types metadata and without it, nothing seems to help.

I printed the model metadata just to check if the model inferred properly the data types and everything seems correct.

Here I attach the code used just in case it helps (this is the last version used in which the model infers the field types):

import pandas as pd
from sdv.timeseries import PAR
from sdv.metrics.timeseries import TSFClassifierEfficacy

data = pd.read_csv("data/micro_batch_task.csv")
sequence_index = 'start_time'
field_types = {
    "instance_num": {
        "type": "numerical",
        'subtype': 'integer'
    },
    "start_time": {
        "type": "numerical",
        'subtype': 'integer'
    },
    "plan_cpu": {
        "type": "numerical",
        'subtype': 'float'
    },
    "plan_mem": {
        "type": "numerical",
        'subtype': 'float'
    },
    "makespan": {
        "type": "numerical",
        'subtype': 'integer'
    },
}
model = PAR(
    sequence_index=sequence_index,
    segment_size=10,
    epochs=1,
    verbose=True
)
model.fit(data)
print(model.get_metadata().to_dict())
new_data = model.sample(1)
print(new_data)
print(TSFClassifierEfficacy.compute(data, new_data, field_types, target='makespan'))

When trying to sample:

PARModel(epochs=1, sample_size=1, cuda='cpu', verbose=True) instance created
Epoch 1 | Loss 0.001459105173125863: 100%|██████████| 1/1 [00:51<00:00, 51.42s/it]
{'fields': {'instance_num': {'type': 'numerical', 'subtype': 'float', 'transformer': None}, 'start_time': {'type': 'numerical', 'subtype': 'integer', 'transformer': None}, 'plan_cpu': {'type': 'numerical', 'subtype': 'float', 'transformer': None}, 'plan_mem': {'type': 'numerical', 'subtype': 'float', 'transformer': None}, 'makespan': {'type': 'numerical', 'subtype': 'integer', 'transformer': None}}, 'constraints': [], 'model_kwargs': {}, 'name': None, 'primary_key': None, 'sequence_index': 'start_time', 'entity_columns': [], 'context_columns': []}
100%|██████████| 1/1 [00:00<00:00, 85.72it/s]
Traceback (most recent call last):
  File "/opt/homebrew/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'start_time'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/damianfernandez/PycharmProjects/sdv/main.py", line 46, in <module>
    new_data = model.sample(1)
  File "/opt/homebrew/lib/python3.8/site-packages/sdv/timeseries/base.py", line 268, in sample
    return self._metadata.reverse_transform(sampled)
  File "/opt/homebrew/lib/python3.8/site-packages/sdv/metadata/table.py", line 700, in reverse_transform
    field_data = reversed_data[name]
  File "/opt/homebrew/lib/python3.8/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/opt/homebrew/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'start_time'

Process finished with exit code 1

Maybe I'm not doing something properly. I'm new to the library!

yamidibarra commented 2 years ago

Dear @npatki thank you in advance for your support! I´m having a similar issue. Here I describe it:

Environment Details

SDV version: 0.16.0 Python version: 3.8.13 Operating System: Windows 10

Error: Exception has occurred: KeyError 'Time'

The above exception was the direct cause of the following exception: File "C:\Users\Data_Augmentation\PAR_Model.py", line 13, in new_data = model.sample(1)

import pandas as pd
from sdv.timeseries import PAR

data = pd.read_pickle('df_PAR.pkl')
context_columns = ['POM', 'Mold Temperature [°C]', 'Injection velocity [cmm/s]', 'Holding pressure [bar]'] 
entity_columns = ['id']
sequence_index = 'Time'

model = PAR(entity_columns=entity_columns,  context_columns=context_columns,  sequence_index=sequence_index)

model.fit(data)
new_data = model.sample(1)

model.save('Timeseries_synthetic_model.pkl')

Attached you will find .py file and .pkl file with data PS: I tried to reproduce the example shown here: https://sdv.dev/SDV/user_guides/timeseries/par.html but I can´t access the file. I wanted to check the type of data variables.

yamidibarra commented 2 years ago

https://github.com/sdv-dev/SDV/issues/808#issuecomment-1133123852

I understand what´s going on. My Time column is float-type, PAR allows only Data-Time type though...

dharmesh1007 commented 2 years ago

@yamidibarra, I'm having the same issue. Time column needing to be in date time format.

npatki commented 2 years ago

Hi everyone,

Yes @yamidibarra, I agree with you. Issue #808 is likely the root cause for all these errors: It is a known issue that the PAR model currently produces a sampling error when sequence_index is numerical (float, int). The error should go away if you express sequence_index as a datetime or if you remove it altogether.

Does this accurately describe everyone's scenario? If so, I can close this issue in favor of #808 for tracking.

npatki commented 2 years ago

BTW --

@DamianUS, thanks for filing this issue! I will delete the comments in #935 since you copied it over here

@yamidibarra, re the link:

PS: I tried to reproduce the example shown here: https://sdv.dev/SDV/user_guides/timeseries/par.html but I can´t access the file. I wanted to check the type of data variables.

The text of the link is correct by the hyperlink is pointing to some other URL. You should be able to open the page if you click on this: https://sdv.dev/SDV/user_guides/timeseries/par.html.

yamidibarra commented 2 years ago

Hi everyone,

Yes @yamidibarra, I agree with you. Issue #808 is likely the root cause for all these errors: It is a known issue that the PAR model currently produces a sampling error when sequence_index is numerical (float, int). The error should go away if you express sequence_index as a datetime or if you remove it altogether.

Does this accurately describe everyone's scenario? If so, I can close this issue in favor of #808 for tracking.

yes, it resolves this specific issue. Here my workaround. I´ll open up another issue regarding the synthetic data. I have some questions and I would appreciate your opinion dear @npatki

data = pd.read_pickle('df_PAR.pkl')
data['Time'] = data['Time'].multiply(1E9)
data['Time'] = pd.to_datetime(data['Time'])

context_columns = ['POM', 'Mold Temperature [°C]', 'Injection velocity [cmm/s]', 'Holding pressure [bar]'] 
entity_columns = ['id']
sequence_index = 'Time'
model = PAR(entity_columns=entity_columns,  context_columns=context_columns,  sequence_index=sequence_index)

model.fit(data)
new_data = model.sample(1)

 # get seconds
new_data['Time']=new_data['Time'].apply(lambda x:'%02d.%06d' %(x.second, x.microsecond)).astype(float)
npatki commented 2 years ago

Great, thanks for confirming! I'll close this issue in favor of #808.

Please feel free to reply if you continue to see a KeyError on the PAR model even if you have a datetime sequence_index and I can reopen this issue for discussion.

mohammedsabiya commented 1 year ago

Hi, I am facing the same KeyError issue in PARsynthesizer as here, even though sequence_index is datetime. Please see the issue #1510.

p.s. the KeyError that I get is from the context_columns

Great, thanks for confirming! I'll close this issue in favor of #808.

Please feel free to reply if you continue to see a KeyError on the PAR model even if you have a datetime sequence_index and I can reopen this issue for discussion.

npatki commented 1 year ago

@mohammedsabiya Thanks for filing! We'll follow up in the new issue, as it's been some time since this original one was resolved.