synthesized-io / community

https://synthesized.io
3 stars 0 forks source link

HighDimSynthesizer throws "Probabilities do not sum to 1" [BUG] #2

Closed aplotnikov2020 closed 1 year ago

aplotnikov2020 commented 1 year ago

Describe the bug I receive the following error during the training stage of a HighDimSynthesizer:

Traceback (most recent call last):
  File "compare_synthetic_vs_actual/simple.py", line 23, in <module>
    df_synth = synth.synthesize(num_rows=len(df))
  File "synthesized/_licence/analytics.py", line 97, in synthesized._licence.analytics.track._track.wrapper
  File "synthesized/complex/highdim.py", line 452, in synthesized.complex.highdim.HighDimSynthesizer.synthesize
  File "synthesized/model/data_frame_model.py", line 56, in synthesized.model.data_frame_model.DataFrameModel.sample
  File "synthesized/model/models/histogram.py", line 161, in synthesized.model.models.histogram.Histogram.sample
  File "mtrand.pyx", line 939, in numpy.random.mtrand.RandomState.choice
ValueError: probabilities do not sum to 1

To Reproduce The code below reproduces the error:

import pandas as pd

from synthesized import MetaExtractor, HighDimSynthesizer
from synthesized.model import DataFrameModel

data = [
    {'day': '2022-10-01', 'country': 'NL', 'y': 1.0},
    {'day': '2022-10-02', 'country': 'NL', 'y': 2.0},
    {'day': '2022-10-03', 'country': 'NL', 'y': 3.0},
    {'day': '2022-10-01', 'country': 'ES', 'y': 10.0},
    {'day': '2022-10-02', 'country': 'ES', 'y': 20.0},
    {'day': '2022-10-03', 'country': 'ES', 'y': 30.0},
]

df = pd.DataFrame.from_records(data)

df_meta = MetaExtractor.extract(df)
DataFrameModel(df_meta).fit(df)

synth = HighDimSynthesizer(df_meta)
synth.learn(df_train=df)

df_synth = synth.synthesize(num_rows=len(df))
print(df_synth)

The problem goes away after casting day column to datetime type:

df['day'] = pd.to_datetime(df['day'])

But using types other that datetime (tried date) also causes the error.

Environment (please complete the following information):

Additional context

nialldevlin1 commented 1 year ago

Hi, thanks for submitting this report!

We've identified the cause of this issue and will implement a fix in version ~2.3~ 2.2 of the SDK. Until then we recommend casting the dates from string to datetime before MetaExtractor.extract() as you have done i.e.

df["day"] = pd.to_datetime(df["day"])

Thank you again!

edit: changed 2.3 to 2.2

nialldevlin1 commented 1 year ago

This fix has now been implemented in v2.2, available now on PyPI. Thanks again for raising this issue with us!