sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.21k stars 287 forks source link

Repeated sequence_index values in specific situations #2004

Closed Ng-ms closed 1 month ago

Ng-ms commented 1 month ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

after applying the workaround in #1973 , the sequnece_index is not Nan anymore, but I notice a repeated sequence index (In my dataset is the date of the visit ) so this date should be unique because it is a unique visit, Screenshot from 2024-05-14 14-42-07

npatki commented 1 month ago

Thanks for filing @Ng-ms. Adding a link to our previous conversation.

As with all these issues, it is very beneficial to us if you can provide some data to help us replicate it. Otherwise, it may take us longer to find the root cause.

Summary

Next Steps

The SDV team will investigate the cause of this issue. Internally, I would want to check for two things:

  1. We compute the differences in the sequence index columns, and then later synthesize them. When we do this, we enforce min/max values -- meaning that we should never synthesize any interval that is <min or >max, as observed from the real data. We should verify this is working.
  2. Perhaps what we're seeing then could be a rounding error? Eg instead of synthesizing +1 day we might be synthesizing +23hr 59min, which is being rounded down for some reason
Ng-ms commented 1 month ago

Thank you, @npatki, unfortunately, I am not able to share the data since it belongs to a hospital,

min_max_scaler = MinMaxScaler() df[numeric_columns] = min_max_scaler.fit_transform(df[numeric_columns]) df[date_columns] = df[date_columns].apply(pd.to_datetime,format='%d/%m/%Y', errors= 'coerce') df['pre_date'] = pd.to_datetime(df['pre_date'], unit= 'ns').astype(int) metadata.set_sequence_index(column_name='visit_date') synthesizer = PARSynthesizer(metadata,epochs=1000, context_columns= ['pre_date',sex,'Cod',], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True) synthesizer.fit(df) synthetic_data = synthesizer.sample(num_sequences=4000,sequence_length=None) this is the code I am using right now . in the image below you can see the number of duplicated dates of "visit_date " for the synthetic data image

while here in the real data, Series([], dtype: int64) ,

Ng-ms commented 1 month ago

I just want to add the Diagnostic Report results maybe it will be helpful Property Score 0 Data Validity 0.973778 1 Data Structure 1.000000

Screenshot from 2024-05-16 16-28-39

npatki commented 1 month ago

Hi @Ng-ms thanks. We will try to replicate it but just be aware that it may take us some time since we don't have a dataset to be working with.

Thank you for running the diagnostic. These scores are supposed to be 1.0. Just to confirm in this example: pre_date is a context column and visit_date is a sequence index? We will focus this issue on repeated values in the sequence index, but I will also file a new issue for why pre_date, a context column, is out of bounds (aka has a lower BoundaryAdherence score)

Ng-ms commented 1 month ago

hello, yes pre_date is a context column and visit_date is a sequence index

srinify commented 1 month ago

Hi there @Ng-ms I tried to replicate this with our demo datasets and wasn't able to yet unfortunately.

If you're able to meet us halfway by trying to modify one of our demo datasets (deleting rows, adding context columns, formatting date time values in a similar way to your dataset, etc) to force this error to occur, that would be a massive help for us.

The following code snippet can get you started with a demo dataset that has a datetime column:

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='sequential',
    dataset_name='nasdaq100_2019'
)

Either way, we'll keep this issue open to see if others have run into it and can help us reproduce this.

Scit3ch commented 1 month ago

I can confirm the issue @Ng-ms described, as I have the same problems with my dataset.

Environment: SDV version: 1.12.1 Python version: 3.12 Operating System: Windows

I'm also not able to share the data as it's confidential.

However, I can provide some more information and a guess what's causing the issue.

The dataset I use has the following structure and properties:

The synthesizer is trained via (more epochs were also tested, but don't change the issue): synthesizer = PARSynthesizer( metadata, enforce_min_max_values=True, enforce_rounding=False, epochs=100, context_columns=['ContextColumn1', 'ContextColumn2', 'ContextColumn3'], verbose=True )

The synthetic data is generated via: synthetic_data = synthesizer.sample( num_sequences=20, sequence_length=None )

What I've noticed is the following:

I guess the problem lies in the way the synthetic data sequences are generated and here I have to guess how the process probably works, because I haven't looked into the code details:

If my guess is correct I would suggest an option to exclude the sequence_index column from the enforce_min_max_values option. Otherwise, the algorithm has to plan ahead when starting a sequence if it will reach the max. limit and need to adapt e.g. the used gap based on the number of entries which will be generated in the sequence.

srinify commented 1 month ago

@Scit3ch very thorough & excellent analysis! You seem to be right about this and I was able to replicate this on a much simpler dataset where I experimented with decreasing sequence_length to approach the number of unique values in the original number of unique values in a single sequence in the original dataset.

In simpler terms, trying to synthesize a sequence of 7 rows where one of the original sequences only had 5 rows caused SDV to generate duplicates and essentially "run out" of dates to generate because of enforce_min_max_values being True.

Context columns don't seem to matter here. Thanks for all the help! I will close out this issue so we can track the work and discussion better in the bug report I've opened: https://github.com/sdv-dev/SDV/issues/2031

Some short term workarounds with tradeoffs: