synthesized-io / community

https://synthesized.io
3 stars 0 forks source link

Pandas DataFrame index preserving generation #3

Closed aplotnikov2020 closed 1 year ago

aplotnikov2020 commented 1 year ago

Currently HighDimSynthesizer does not preserve Pandas DataFrame index. Let's consider the following example:

import pandas as pd
from datetime import datetime

from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.model import DataFrameModel

data = [
    {'day': datetime(2022, 10, 1), 'country': 'NL', 'y': 1.0},
    {'day': datetime(2022, 10, 2), 'country': 'NL', 'y': 2.0},
    {'day': datetime(2022, 10, 3), 'country': 'NL', 'y': 3.0},
    {'day': datetime(2022, 10, 1), 'country': 'ES', 'y': 10.0},
    {'day': datetime(2022, 10, 2), 'country': 'ES', 'y': 20.0},
    {'day': datetime(2022, 10, 3), 'country': 'ES', 'y': 30.0},
]

df = pd.DataFrame.from_records(data, index=['day', 'country'])

print(df)

df_meta = MetaExtractor.extract(df)
DataFrameModel(df_meta).fit(df)

synth = HighDimSynthesizer(df_meta)
synth.learn(df_train=df)

df_synth = synth.synthesize(num_rows=len(df))
print(df_synth)

Original data:

                       y
day        country      
2022-10-01 NL        1.0
2022-10-02 NL        2.0
2022-10-03 NL        3.0
2022-10-01 ES       10.0
2022-10-02 ES       20.0
2022-10-03 ES       30.0

Synthesized data:

    y
0  30
1   2
2   2
3   3
4  10
5  30

Desired behavior I'd expect synthesized to produce index values along with y values.

Workarounds If synthesized data has the same number of rows, we can just concatenate the original and produced DataFrames:

df_recovered = pd.concat([df.reset_index().drop('y'), df_synth], axis=1)

Possible caveats There might be an issue with the lack of unique values for the index

nialldevlin1 commented 1 year ago

Hi, thanks for reporting this!

Currently, the HighDimSynthesizer does not consider index values and so we don't support index preservation. If the index values are required, we would recommend using the reset_index() method to transfer these index values to columns. We may decide to support this feature in a future release if there is enough interest in it.

If you're interested in time-series applications then have a look at our (new!) regular and event-based time-series synthesizers https://docs.synthesized.io/sdk/latest/user_guide/time_series_synthesis/. These are currently a beta release and only available on the paid version.

Thanks again for sharing this! For the meantime, we will close this issue as 'Not planned'.