PARSynthesizer samples uniformly distributed time series data

ardulat commented 2 days ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

SDV version: 1.15.9
Python version: 3.12
Operating System: linux/amd64 (Docker image)

Problem description

I've been using SDV for quite a while now. However, recently, after analyzing the sampled data, I observed a weird behavior in sampling time series data. The issue is that the sequential model PARSynthesizer keeps generating uniform distributions for time series data in almost all my columns. I am attaching two plots, which clearly show the difference.

Actual data distribution plot:

Synthetic data distribution plot:

What I already tried

I tried synthesizing on different datasets and with different numbers of epochs. The code snippet related to the model fitting:

# Initialize synthesizer
self.synthesizer = self.model_class(
    self.metadata,
    epochs=self.epochs,
    cuda=self.cuda,
    context_columns=[
        col for col in self.context_columns
    ],
    verbose=True,
    # Control whether the synthetic data should adhere to the same min/max
    # boundaries set by the real data
    enforce_min_max_values=True,
    # Control whether the synthetic data should have the same number of decimal
    # digits as the real data
    enforce_rounding=False,
)

# Fit synthesizer
self.synthesizer.fit(data)

I can't share the data or anything related to that (including metadata) since it is sensitive medical data.

srinify commented 2 days ago

Hi @ardulat without metadata, this might be challenging to debug but let's try!

Are the synthetic distributions uniform for context columns and non-context columns?
How many columns fall into each bucket (context vs non-context)?

In general, PARSynthesizer is one our less mature synthesizers compared to our other single and multi table synthesizers. That alone could be causing this behavior, but it would be great to rule out a few other things first.

ardulat commented 1 day ago

Hi @srinify, thank you for your quick response.

Here is what metadata looks like (I removed the exact column names to preserve privacy):

{
  "columns": {
    "sequence_id": {
      "sdtype": "id"
    },
    "context_column1": {
      "sdtype": "categorical"
    },
    "context_column2": {
      "sdtype": "categorical"
    },
    "context_column3": {
      "sdtype": "categorical"
    },
    "context_column4": {
      "sdtype": "categorical"
    },
    "context_column5": {
      "sdtype": "categorical"
    },
    "context_column6": {
      "sdtype": "numerical"
    },
    "context_column7": {
      "sdtype": "categorical"
    },
    "time_series_column1": {
      "sdtype": "numerical"
    },
    "time_series_column2": {
      "sdtype": "numerical"
    },
    "time_series_column3": {
      "sdtype": "categorical"
    },
    "time_series_column4": {
      "sdtype": "categorical"
    },
    "time_series_column5": {
      "sdtype": "categorical"
    },
    "time_series_column6": {
      "sdtype": "numerical"
    },
    "time_series_column7": {
      "sdtype": "numerical"
    },
    "time_series_column8": {
      "sdtype": "numerical"
    },
    "time_series_column9": {
      "sdtype": "numerical"
    },
    "time_series_column10": {
      "sdtype": "numerical"
    },
    "time_series_column11": {
      "sdtype": "numerical"
    },
    "time_series_column12": {
      "sdtype": "numerical"
    },
    "time_series_column13__steps": {
      "sdtype": "numerical"
    },
    "time_series_column14": {
      "sdtype": "numerical"
    },
    "time_series_column15": {
      "sdtype": "numerical"
    },
    "time_series_column16": {
      "sdtype": "numerical"
    },
    "time_series_column17": {
      "sdtype": "numerical"
    },
    "time_series_column18": {
      "sdtype": "numerical"
    },
    "time_series_column19": {
      "sdtype": "numerical"
    },
    "time_series_column20": {
      "sdtype": "numerical"
    },
    "time_series_column21": {
      "sdtype": "numerical"
    },
    "date": {
      "sdtype": "datetime",
      "datetime_format": "%Y-%m-%d"
    },
    "primary_key": {
      "sdtype": "id"
    }
  },
  "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
  "primary_key": "primary_key",
  "sequence_index": "date",
  "sequence_key": "sequence_id",
  "synthesizer_info": {
    "class_name": "PARSynthesizer",
    "creation_date": "2024-09-18",
    "is_fit": true,
    "last_fit_date": "2024-09-18",
    "fitted_sdv_version": "1.15.0"
  }
}

Few issues with this metadata:

context_column6 corresponds to dates converted to timestamps according to this: https://github.com/sdv-dev/SDV/issues/2115. But this leads to irrelevant dates, e.g., 1617 and 2253 years.
Categorical time series columns (suffix 3-5) produce float numbers. The same applies to numerical columns of integer type.
Sampled dates include 36% of null values on average.

Answering your questions:

No, among the context columns, only one is numerical and not uniform. For the non-context columns, all or almost all columns form uniform distributions.
There are 7 context columns and 21 non-context columns (not including ids and date columns).

sdv-dev / SDV