Repeated sequence_index values in specific situations

Ng-ms commented 1 month ago

Environment Details

Please indicate the following details about the environment in which you found the bug:

SDV version: 1.12.0
Python version: 3.12
Operating System: LInux

after applying the workaround in #1973 , the sequnece_index is not Nan anymore, but I notice a repeated sequence index (In my dataset is the date of the visit ) so this date should be unique because it is a unique visit, Screenshot from 2024-05-14 14-42-07

npatki commented 1 month ago

Thanks for filing @Ng-ms. Adding a link to our previous conversation.

As with all these issues, it is very beneficial to us if you can provide some data to help us replicate it. Otherwise, it may take us longer to find the root cause.

Summary

The PARSynthesizer should be creating data that is similar to real data. If real data has repeating sequence indices, then we would also expect the synthetic data to have sequence indices
You have confirmed that in this case, the real data does not have repeating sequence indices.

Next Steps

The SDV team will investigate the cause of this issue. Internally, I would want to check for two things:

We compute the differences in the sequence index columns, and then later synthesize them. When we do this, we enforce min/max values -- meaning that we should never synthesize any interval that is <min or >max, as observed from the real data. We should verify this is working.
Perhaps what we're seeing then could be a rounding error? Eg instead of synthesizing +1 day we might be synthesizing +23hr 59min, which is being rounded down for some reason

Ng-ms commented 1 month ago

Thank you, @npatki, unfortunately, I am not able to share the data since it belongs to a hospital,

min_max_scaler = MinMaxScaler() df[numeric_columns] = min_max_scaler.fit_transform(df[numeric_columns]) df[date_columns] = df[date_columns].apply(pd.to_datetime,format='%d/%m/%Y', errors= 'coerce') df['pre_date'] = pd.to_datetime(df['pre_date'], unit= 'ns').astype(int) metadata.set_sequence_index(column_name='visit_date') synthesizer = PARSynthesizer(metadata,epochs=1000, context_columns= ['pre_date',sex,'Cod',], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True) synthesizer.fit(df) synthetic_data = synthesizer.sample(num_sequences=4000,sequence_length=None) this is the code I am using right now . in the image below you can see the number of duplicated dates of "visit_date " for the synthetic data

while here in the real data, Series([], dtype: int64) ,

Ng-ms commented 1 month ago

I just want to add the Diagnostic Report results maybe it will be helpful Property Score 0 Data Validity 0.973778 1 Data Structure 1.000000

Screenshot from 2024-05-16 16-28-39

npatki commented 1 month ago

Hi @Ng-ms thanks. We will try to replicate it but just be aware that it may take us some time since we don't have a dataset to be working with.

Thank you for running the diagnostic. These scores are supposed to be 1.0. Just to confirm in this example: pre_date is a context column and visit_date is a sequence index? We will focus this issue on repeated values in the sequence index, but I will also file a new issue for why pre_date, a context column, is out of bounds (aka has a lower BoundaryAdherence score)

Ng-ms commented 1 month ago

hello, yes pre_date is a context column and visit_date is a sequence index

srinify commented 1 month ago

Hi there @Ng-ms I tried to replicate this with our demo datasets and wasn't able to yet unfortunately.

If you're able to meet us halfway by trying to modify one of our demo datasets (deleting rows, adding context columns, formatting date time values in a similar way to your dataset, etc) to force this error to occur, that would be a massive help for us.

The following code snippet can get you started with a demo dataset that has a datetime column:

from sdv.datasets.demo import download_demo

real_data, metadata = download_demo(
    modality='sequential',
    dataset_name='nasdaq100_2019'
)

Either way, we'll keep this issue open to see if others have run into it and can help us reproduce this.

Scit3ch commented 1 month ago

I can confirm the issue @Ng-ms described, as I have the same problems with my dataset.

Environment: SDV version: 1.12.1 Python version: 3.12 Operating System: Windows

I'm also not able to share the data as it's confidential.

However, I can provide some more information and a guess what's causing the issue.

The dataset I use has the following structure and properties:

A sequence_key column of sdtype "id"
A sequence_index column of sdtype "datetime" and format "%Y-%m-%d %H:%M:%S"
3 more context columns (2 of type "categorical" and 1 "numerical")
2 more columns of type "categorical"
For all data rows with the same sequence_key it is ensured that the sequence_index has no repeating values and are strictly increasing
The sequence_index date range is from 01.10.2023 00:00:00 to 29.04.2024 23:00:00
The amount of data points per sequence_key is very variable ranging from 30 to several hundred. Also the gap between two sequence entries is very variable (so not one data point every 24h for example). Some of the sequences only begin way after 01.10.2023 00:00:00 or end long before 29.04.2024 23:00:00.

The synthesizer is trained via (more epochs were also tested, but don't change the issue): synthesizer = PARSynthesizer( metadata, enforce_min_max_values=True, enforce_rounding=False, epochs=100, context_columns=['ContextColumn1', 'ContextColumn2', 'ContextColumn3'], verbose=True )

The synthetic data is generated via: synthetic_data = synthesizer.sample( num_sequences=20, sequence_length=None )

What I've noticed is the following:

Generated sequences which are short are ok and contain no duplicates for the sequence_index
Generated sequences which are long and reach the end date range of the training data will generate data rows with the same sequence_index. The duplicate value is then the end date range of the training data (in this example 29.04.2024 23:00:00)
Limiting the sequence_length when sampling data to a low value (like 20) prevents generated sequences from reaching the end data range. Thus, duplicates can be prevented.
What also helps is setting enforce_min_max_values to false when training the synthesizer. But than the generated data has a much more wide spread range for the sequence_index than the training data and the fitting is bad.

I guess the problem lies in the way the synthetic data sequences are generated and here I have to guess how the process probably works, because I haven't looked into the code details:

For the first generated data row of a synthetic sequence a start date for the sequence_index is generated
All following rows of this sequence must be greater than the previous sequence_index date. The "gap" is based on the learned data.
If now a long sequence is generated with gaps which are too big the algorithm reaches the end date of the training data and limits every sequence_index which is greater to the maximum, because of the parameter enforce_min_max_values.

If my guess is correct I would suggest an option to exclude the sequence_index column from the enforce_min_max_values option. Otherwise, the algorithm has to plan ahead when starting a sequence if it will reach the max. limit and need to adapt e.g. the used gap based on the number of entries which will be generated in the sequence.

srinify commented 1 month ago

@Scit3ch very thorough & excellent analysis! You seem to be right about this and I was able to replicate this on a much simpler dataset where I experimented with decreasing sequence_length to approach the number of unique values in the original number of unique values in a single sequence in the original dataset.

In simpler terms, trying to synthesize a sequence of 7 rows where one of the original sequences only had 5 rows caused SDV to generate duplicates and essentially "run out" of dates to generate because of enforce_min_max_values being True.

Context columns don't seem to matter here. Thanks for all the help! I will close out this issue so we can track the work and discussion better in the bug report I've opened: https://github.com/sdv-dev/SDV/issues/2031

Some short term workarounds with tradeoffs:

You can keep enforce_min_max_values as False and this will remove the max value ceiling for the datetime sequence_ ey column. But this will mean that the synthesized data will be less representative of your real data so this is a big tradeoff until this bug is fixed.
You can set num_sequences to be identical to the number of rows in your smallest, least unique (when it comes to the sequence key column) sequence from your real data. E.g. if you have a small sequence with 5 unique values for the sequence key, don't generate more than 5 rows per sequence. But this is also a limitation of SDV until this bug is fixed.

sdv-dev / SDV