sdv-dev / SDGym

Benchmarking synthetic data generation methods.
Other
252 stars 58 forks source link

Reproducibility #120

Closed BiggyBing closed 1 year ago

BiggyBing commented 3 years ago

Description

I am going to reproduce all results reported in the CTGAN paper. However, I cannot fully reproduce the reported results:

  1. For CTGAN, by running the below code, I can rarely reproduce the same results.
  2. For the credit dataset, it seems that the dataset sdgym package is not the same as that reported in the paper.

What I Did

For reproducing, I follow the demo:

import sdgym
from sdv.tabular import GaussianCopula, CTGAN
from sdgym.synthesizers import (
    CLBN, CopulaGAN, CTGAN, Identity, Independent,
    MedGAN)

scores = sdgym.run(synthesizers=CTGAN, datasets=['asia'])
scores = sdgym.run(synthesizers=Identity, datasets=['credit'])
npatki commented 1 year ago

Hi @BiggyBing, I'm not sure if you are still having problems with this, as it's been some time since the issue was filed.

Since this issue was filed, we have significantly updated the usage/API and streamlined the SDGym library's functionality. The benchmarking scripts will now report several metrics such as modeling time, memory usage and synthetic data quality.

For more resources see the new SDGym documentation.

Unfortunately, the benchmarking script cannot be used as-is for to reproduce the experiment in the CTGAN paper -- as this experiment has a more advanced setup with an ML integration and holdout group. If you are still looking to reproduce the CTGAN results, I'd recommend asking the question in the SDV library.