Closed Ng-ms closed 1 month ago
Thanks for filing @Ng-ms. Adding a link to our previous conversation.
As with all these issues, it is very beneficial to us if you can provide some data to help us replicate it. Otherwise, it may take us longer to find the root cause.
The SDV team will investigate the cause of this issue. Internally, I would want to check for two things:
<min
or >max
, as observed from the real data. We should verify this is working.Thank you, @npatki, unfortunately, I am not able to share the data since it belongs to a hospital,
min_max_scaler = MinMaxScaler()
df[numeric_columns] = min_max_scaler.fit_transform(df[numeric_columns])
df[date_columns] = df[date_columns].apply(pd.to_datetime,format='%d/%m/%Y', errors= 'coerce')
df['pre_date'] = pd.to_datetime(df['pre_date'], unit= 'ns').astype(int)
metadata.set_sequence_index(column_name='visit_date')
synthesizer = PARSynthesizer(metadata,epochs=1000, context_columns= ['pre_date',sex,'Cod',], verbose=True, enforce_min_max_values=True, enforce_rounding=True, cuda=True)
synthesizer.fit(df)
synthetic_data = synthesizer.sample(num_sequences=4000,sequence_length=None)
this is the code I am using right now .
in the image below you can see the number of duplicated dates of "visit_date " for the synthetic data
while here in the real data, Series([], dtype: int64) ,
I just want to add the Diagnostic Report results maybe it will be helpful Property Score 0 Data Validity 0.973778 1 Data Structure 1.000000
Hi @Ng-ms thanks. We will try to replicate it but just be aware that it may take us some time since we don't have a dataset to be working with.
Thank you for running the diagnostic. These scores are supposed to be 1.0. Just to confirm in this example: pre_date
is a context column and visit_date
is a sequence index? We will focus this issue on repeated values in the sequence index, but I will also file a new issue for why pre_date
, a context column, is out of bounds (aka has a lower BoundaryAdherence score)
hello, yes pre_date is a context column and visit_date is a sequence index
Hi there @Ng-ms I tried to replicate this with our demo datasets and wasn't able to yet unfortunately.
If you're able to meet us halfway by trying to modify one of our demo datasets (deleting rows, adding context columns, formatting date time values in a similar way to your dataset, etc) to force this error to occur, that would be a massive help for us.
The following code snippet can get you started with a demo dataset that has a datetime column:
from sdv.datasets.demo import download_demo
real_data, metadata = download_demo(
modality='sequential',
dataset_name='nasdaq100_2019'
)
Either way, we'll keep this issue open to see if others have run into it and can help us reproduce this.
I can confirm the issue @Ng-ms described, as I have the same problems with my dataset.
Environment: SDV version: 1.12.1 Python version: 3.12 Operating System: Windows
I'm also not able to share the data as it's confidential.
However, I can provide some more information and a guess what's causing the issue.
The dataset I use has the following structure and properties:
The synthesizer is trained via (more epochs were also tested, but don't change the issue):
synthesizer = PARSynthesizer( metadata, enforce_min_max_values=True, enforce_rounding=False, epochs=100, context_columns=['ContextColumn1', 'ContextColumn2', 'ContextColumn3'], verbose=True )
The synthetic data is generated via:
synthetic_data = synthesizer.sample( num_sequences=20, sequence_length=None )
What I've noticed is the following:
I guess the problem lies in the way the synthetic data sequences are generated and here I have to guess how the process probably works, because I haven't looked into the code details:
If my guess is correct I would suggest an option to exclude the sequence_index column from the enforce_min_max_values option. Otherwise, the algorithm has to plan ahead when starting a sequence if it will reach the max. limit and need to adapt e.g. the used gap based on the number of entries which will be generated in the sequence.
@Scit3ch very thorough & excellent analysis! You seem to be right about this and I was able to replicate this on a much simpler dataset where I experimented with decreasing sequence_length
to approach the number of unique values in the original number of unique values in a single sequence in the original dataset.
In simpler terms, trying to synthesize a sequence of 7 rows where one of the original sequences only had 5 rows caused SDV to generate duplicates and essentially "run out" of dates to generate because of enforce_min_max_values
being True.
Context columns don't seem to matter here. Thanks for all the help! I will close out this issue so we can track the work and discussion better in the bug report I've opened: https://github.com/sdv-dev/SDV/issues/2031
Some short term workarounds with tradeoffs:
enforce_min_max_values
as False and this will remove the max value ceiling for the datetime sequence_ ey column. But this will mean that the synthesized data will be less representative of your real data so this is a big tradeoff until this bug is fixed.num_sequences
to be identical to the number of rows in your smallest, least unique (when it comes to the sequence key column) sequence from your real data. E.g. if you have a small sequence with 5 unique values for the sequence key, don't generate more than 5 rows per sequence. But this is also a limitation of SDV until this bug is fixed.
Environment Details
Please indicate the following details about the environment in which you found the bug:
after applying the workaround in #1973 , the sequnece_index is not Nan anymore, but I notice a repeated sequence index (In my dataset is the date of the visit ) so this date should be unique because it is a unique visit,![Screenshot from 2024-05-14 14-42-07](https://github.com/sdv-dev/SDV/assets/17408097/a322d733-f370-48b3-b871-a16d9bc713eb)