sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.23k stars 295 forks source link

How to evaluate the quality of synthetic time series data generated from PARSynthesizer #2113

Open Pavamana15 opened 2 weeks ago

Pavamana15 commented 2 weeks ago

Environment details

If you are already running SDV, please indicate the following details about the environment in which you are running it:

Problem description

@npatki, Can you also explain how to evaluate the quality of synthetic data? I have generated synthetic time series data using PARSynthesizer. Now, I want to test how good the synthetic data is compared to real data. https://colab.research.google.com/drive/1YLk2uwn8yrSRPy0soEeJwu8Hdk_tGTlE?usp=sharing says, "The synthesizer is generating entirely new sequences in the same format as the real data. Each sequence represents an entirely new company based on the overall patterns from the dataset. They do not map or correspond to any real company." With this statement, it is clear that I can't compare synthetic data with the corresponding real data. But, I want to test the following things

Diversity: The distribution of the synthetic samples should roughly match that of the real data. We can use dimensionality reduction (principal components analysis (PCA) and t-SNE) to visually inspect how closely the distribution of the synthetic samples resembles that of the original data. We can also use the correlation technique to see how closely synthetic samples resemble the original data.

Fidelity: The sample series should be indistinguishable from the real data. For this, we can train a classifier to distinguish real and synthetic data for the same predictive purposes (i.e. train-on-synthetic, test-on-real).

I want to evaluate the above two metrics, Diversity and Fidelity, on synthetic data. Do you know how I can do that?

Since each sequence corresponds to an entirely new company and does not map to any real company, I can't evaluate those two metrics against ground truth. So, are there any ways to check the above two metrics?

What I already tried

<Replace with a description of what you already tried and what is the behavior that you observe. If possible, also add below the exact code that you are running.>

Paste the command(s) you ran and the output.
If there was a crash, please include the traceback here.
Pavamana15 commented 2 weeks ago

@npatki Can you answer above question here?

npatki commented 2 weeks ago

Hi @Pavamana15, it's great to hear that you were able to create synthetic data. You're right that since this synthesizer creates brand new sequences, it's not possible to do any kind of 1-1 comparison with real vs. synthetic sequences.

Here are my recommendations:

Pavamana15 commented 2 weeks ago

Thank You @npatki . I will try above methods and get back to you if I have any doubts