sdv-dev / SDV

Synthetic data generation for tabular data
https://docs.sdv.dev/sdv
Other
2.3k stars 303 forks source link

Gaussian KDE slower #1103

Open R-Palazzo opened 1 year ago

R-Palazzo commented 1 year ago

Environment details

Problem description

I'm looking to sample a 1D distribution using the gaussian_kde option of the parameter field_distributions of GaussianCopula(). real_data is a pd.Dataframe() with only 1 column named 'Data'.

When I run

gc_synthetizer = GaussianCopula(field_distributions={'Data':'gaussian_kde'})
gc_synthetizer.fit(np.round(real_data,14))
synthetic_data = gc_synthetizer.sample(len(real_data))

It works, but it's exponentially longer than GaussianCopula() with default parameters. I tried different numbers of samples for the real_data and it's 50 to 200 times longer with gaussian_kde. I also tried the gaussian_kde() of Scipy, and It's much faster to fit and sample from it. It's roughly the same time or a bit longer than GaussianCopula() with default parameters.

npatki commented 1 year ago

Marking this as a feature request, as we can track this as a performance improvement.

Part of the issue is that for the purposes of Gaussian Copula computation, it is not enough just to fit a KDE. We also have to convert the distribution using a CDF (and back). We should probably profile all steps of this.