sdv-dev / SDGym

Benchmarking synthetic data generation methods.
Other
252 stars 58 forks source link

Benchmark with F1 score (reproduce leaderboard) #133

Closed mnwright closed 1 year ago

mnwright commented 2 years ago

I'm trying to reproduce you CTGAN NeurIPS paper and/or the leaderboard linked here on Github and I wonder how to benchmark with the F1 score.

I'm following the code in README.md, i.e.:

import numpy as np
import pandas as pd
from sdv.tabular import GaussianCopula

def gaussian_copula(real_data, metadata):
    gc = GaussianCopula(default_distribution='gaussian')
    table_name = metadata.get_tables()[0]
    gc.fit(real_data[table_name])
    return {table_name: gc.sample()}

import sdgym

scores = sdgym.run(synthesizers=gaussian_copula, datasets=['adult'])

But it seems the result is accuracy and not F1? Where can I find the code to reproduce the leaderboard linked in the repo (this one)?

Thanks!

npatki commented 1 year ago

Hi @mnwright, are you still looking into this project? I realize it's been over a year since the issue was filed.

Since then, we've significantly updated the SDGym usage/API and streamlined its functionality. The benchmarking script now reports several measurements such as performance, memory and an overall quality score. (For more information, see the new SDGym docs here.)

It seems like the setup you are describing (with an accuracy or F1 scores) requires a more complex setup with identifying a target column and training an ML model. The SDGym is not able to accommodate this.

If you are still looking to reproduce results from CTGAN, I'd recommend you ask a question in the SDV library instead. We may be able to guide you through a more custom setup that is similar to the paper.