Open ardulat opened 2 days ago
Hi @ardulat without metadata, this might be challenging to debug but let's try!
In general, PARSynthesizer is one our less mature synthesizers compared to our other single and multi table synthesizers. That alone could be causing this behavior, but it would be great to rule out a few other things first.
Hi @srinify, thank you for your quick response.
Here is what metadata looks like (I removed the exact column names to preserve privacy):
{
"columns": {
"sequence_id": {
"sdtype": "id"
},
"context_column1": {
"sdtype": "categorical"
},
"context_column2": {
"sdtype": "categorical"
},
"context_column3": {
"sdtype": "categorical"
},
"context_column4": {
"sdtype": "categorical"
},
"context_column5": {
"sdtype": "categorical"
},
"context_column6": {
"sdtype": "numerical"
},
"context_column7": {
"sdtype": "categorical"
},
"time_series_column1": {
"sdtype": "numerical"
},
"time_series_column2": {
"sdtype": "numerical"
},
"time_series_column3": {
"sdtype": "categorical"
},
"time_series_column4": {
"sdtype": "categorical"
},
"time_series_column5": {
"sdtype": "categorical"
},
"time_series_column6": {
"sdtype": "numerical"
},
"time_series_column7": {
"sdtype": "numerical"
},
"time_series_column8": {
"sdtype": "numerical"
},
"time_series_column9": {
"sdtype": "numerical"
},
"time_series_column10": {
"sdtype": "numerical"
},
"time_series_column11": {
"sdtype": "numerical"
},
"time_series_column12": {
"sdtype": "numerical"
},
"time_series_column13__steps": {
"sdtype": "numerical"
},
"time_series_column14": {
"sdtype": "numerical"
},
"time_series_column15": {
"sdtype": "numerical"
},
"time_series_column16": {
"sdtype": "numerical"
},
"time_series_column17": {
"sdtype": "numerical"
},
"time_series_column18": {
"sdtype": "numerical"
},
"time_series_column19": {
"sdtype": "numerical"
},
"time_series_column20": {
"sdtype": "numerical"
},
"time_series_column21": {
"sdtype": "numerical"
},
"date": {
"sdtype": "datetime",
"datetime_format": "%Y-%m-%d"
},
"primary_key": {
"sdtype": "id"
}
},
"METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
"primary_key": "primary_key",
"sequence_index": "date",
"sequence_key": "sequence_id",
"synthesizer_info": {
"class_name": "PARSynthesizer",
"creation_date": "2024-09-18",
"is_fit": true,
"last_fit_date": "2024-09-18",
"fitted_sdv_version": "1.15.0"
}
}
Few issues with this metadata:
context_column6
corresponds to dates converted to timestamps according to this: https://github.com/sdv-dev/SDV/issues/2115. But this leads to irrelevant dates, e.g., 1617 and 2253 years.Answering your questions:
Environment details
If you are already running SDV, please indicate the following details about the environment in which you are running it:
Problem description
I've been using SDV for quite a while now. However, recently, after analyzing the sampled data, I observed a weird behavior in sampling time series data. The issue is that the sequential model
PARSynthesizer
keeps generating uniform distributions for time series data in almost all my columns. I am attaching two plots, which clearly show the difference.Actual data distribution plot:
Synthetic data distribution plot:
What I already tried
I tried synthesizing on different datasets and with different numbers of epochs. The code snippet related to the model fitting:
I can't share the data or anything related to that (including metadata) since it is sensitive medical data.