sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

PAR Diagnostic is not 1.0 for datetime context columns #2018

Open npatki opened 1 month ago

npatki commented 1 month ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

Error Description

As originally described by @Ng-ms in #2004: When there was a datetime context column, the min/max bounds for the synthesized data were outside the observed range from the real data. This is causing the BoundaryAdherence score to be <1.0 for that context column.

Steps to reproduce

Note that the dataset is not available for privacy reasons. The SDV team will try to replicate this with SDV demo data.

min_max_scaler = MinMaxScaler()
df[numeric_columns] = min_max_scaler.fit_transform(df[numeric_columns])
df[date_columns] = df[date_columns].apply(pd.to_datetime,format='%d/%m/%Y', errors= 'coerce')
df['pre_date'] = pd.to_datetime(df['pre_date'], unit= 'ns').astype(int)
metadata.set_sequence_index(column_name='visit_date')
synthesizer = PARSynthesizer(metadata,epochs=1000, context_columns= ['pre_date',sex,'Cod',], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True)
synthesizer.fit(df)
synthetic_data = synthesizer.sample(num_sequences=4000,sequence_length=None)

Diagnostic score output: image

For this issue let's just focus on the fact that context column pre_date has a score <1.0. There is a separate issue for the sequence index visit_date.

srinify commented 1 month ago

I'm not able to reproduce this issue using our demo datasets (or even using randomly generated data).

I'll leave this issue open if someone is able to come along and share code to help us reproduce this issue! @ng-ms