sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.31k stars 304 forks source link

PARSynthesizer samples uniformly distributed time series data #2241

Open ardulat opened 2 days ago

ardulat commented 2 days ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

I've been using SDV for quite a while now. However, recently, after analyzing the sampled data, I observed a weird behavior in sampling time series data. The issue is that the sequential model PARSynthesizer keeps generating uniform distributions for time series data in almost all my columns. I am attaching two plots, which clearly show the difference.

Actual data distribution plot: 16

Synthetic data distribution plot:

Screenshot 2024-09-23 at 2 32 09 PM

What I already tried

I tried synthesizing on different datasets and with different numbers of epochs. The code snippet related to the model fitting:

# Initialize synthesizer
self.synthesizer = self.model_class(
    self.metadata,
    epochs=self.epochs,
    cuda=self.cuda,
    context_columns=[
        col for col in self.context_columns
    ],
    verbose=True,
    # Control whether the synthetic data should adhere to the same min/max
    # boundaries set by the real data
    enforce_min_max_values=True,
    # Control whether the synthetic data should have the same number of decimal
    # digits as the real data
    enforce_rounding=False,
)

# Fit synthesizer
self.synthesizer.fit(data)

I can't share the data or anything related to that (including metadata) since it is sensitive medical data.

srinify commented 2 days ago

Hi @ardulat without metadata, this might be challenging to debug but let's try!

In general, PARSynthesizer is one our less mature synthesizers compared to our other single and multi table synthesizers. That alone could be causing this behavior, but it would be great to rule out a few other things first.

ardulat commented 1 day ago

Hi @srinify, thank you for your quick response.

Here is what metadata looks like (I removed the exact column names to preserve privacy):

{
  "columns": {
    "sequence_id": {
      "sdtype": "id"
    },
    "context_column1": {
      "sdtype": "categorical"
    },
    "context_column2": {
      "sdtype": "categorical"
    },
    "context_column3": {
      "sdtype": "categorical"
    },
    "context_column4": {
      "sdtype": "categorical"
    },
    "context_column5": {
      "sdtype": "categorical"
    },
    "context_column6": {
      "sdtype": "numerical"
    },
    "context_column7": {
      "sdtype": "categorical"
    },
    "time_series_column1": {
      "sdtype": "numerical"
    },
    "time_series_column2": {
      "sdtype": "numerical"
    },
    "time_series_column3": {
      "sdtype": "categorical"
    },
    "time_series_column4": {
      "sdtype": "categorical"
    },
    "time_series_column5": {
      "sdtype": "categorical"
    },
    "time_series_column6": {
      "sdtype": "numerical"
    },
    "time_series_column7": {
      "sdtype": "numerical"
    },
    "time_series_column8": {
      "sdtype": "numerical"
    },
    "time_series_column9": {
      "sdtype": "numerical"
    },
    "time_series_column10": {
      "sdtype": "numerical"
    },
    "time_series_column11": {
      "sdtype": "numerical"
    },
    "time_series_column12": {
      "sdtype": "numerical"
    },
    "time_series_column13__steps": {
      "sdtype": "numerical"
    },
    "time_series_column14": {
      "sdtype": "numerical"
    },
    "time_series_column15": {
      "sdtype": "numerical"
    },
    "time_series_column16": {
      "sdtype": "numerical"
    },
    "time_series_column17": {
      "sdtype": "numerical"
    },
    "time_series_column18": {
      "sdtype": "numerical"
    },
    "time_series_column19": {
      "sdtype": "numerical"
    },
    "time_series_column20": {
      "sdtype": "numerical"
    },
    "time_series_column21": {
      "sdtype": "numerical"
    },
    "date": {
      "sdtype": "datetime",
      "datetime_format": "%Y-%m-%d"
    },
    "primary_key": {
      "sdtype": "id"
    }
  },
  "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
  "primary_key": "primary_key",
  "sequence_index": "date",
  "sequence_key": "sequence_id",
  "synthesizer_info": {
    "class_name": "PARSynthesizer",
    "creation_date": "2024-09-18",
    "is_fit": true,
    "last_fit_date": "2024-09-18",
    "fitted_sdv_version": "1.15.0"
  }
}

Few issues with this metadata:

  1. context_column6 corresponds to dates converted to timestamps according to this: https://github.com/sdv-dev/SDV/issues/2115. But this leads to irrelevant dates, e.g., 1617 and 2253 years.
  2. Categorical time series columns (suffix 3-5) produce float numbers. The same applies to numerical columns of integer type.
  3. Sampled dates include 36% of null values on average.

Answering your questions: