sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
212 stars 45 forks source link

Provide baseline measurement for ML efficacy #161

Open marianokamp opened 3 years ago

marianokamp commented 3 years ago

Problem Description

Right now the metrics are computed based on real data vs synthetic data for ML efficacy. While this information is perfect to gauge if a model could be fit that is good enough, it would also be interesting to learn how much performance we lose because of the synthesization.

Expected behavior

Not sure how to best integrate it with the other metrics? Maybe as additional return values? To stay backwards compatible, returning them conditionen on the caller adding the arg compute_relative_performance=True to compute?

synth_f1 = BinaryDecisionTreeClassifier.compute(data, new_data, target='placed')

vs

synth_f1, real_f1, rel_perf = \
    BinaryDecisionTreeClassifier.compute(data, new_data, target='placed', 
                                         compute_relative_performance=True)

Anyhow the result I'd like to see is something like this:

from sdv.demo import load_tabular_demo
from sdv.metrics.tabular import BinaryDecisionTreeClassifier, BinaryAdaBoostClassifier, BinaryMLPClassifier
from sdv.tabular import CopulaGAN

data = load_tabular_demo('student_placements')

model = CopulaGAN()
model.fit(data)

new_data = model.sample(200)

for clf in [BinaryDecisionTreeClassifier, BinaryAdaBoostClassifier, BinaryMLPClassifier]:
    r_f1 = clf.compute(data, data, target='placed') 
    s_f1 = clf.compute(data, new_data, target='placed') 
    print(f'{clf.__name__:30s} real f1: {r_f1:5.4f} synth f1: {s_f1:5.4f} performance: {s_f1/r_f1:5.2f}')

Resulting in:

BinaryDecisionTreeClassifier   real f1: 1.0000 synth f1: 0.5391 performance:  0.54
BinaryAdaBoostClassifier       real f1: 1.0000 synth f1: 0.6296 performance:  0.63
BinaryMLPClassifier            real f1: 1.0000 synth f1: 0.5693 performance:  0.57

Additional context

I am a bit at a loss here if it is ok to compare both models so directly, as the SDV generation process may produce NaNs and infinities, that are silently replaced in the evaluation code, but may still have an impact.

npatki commented 2 years ago

Hi @marianokamp, nice to meet you. I don't fully follow the request --

Right now the ML efficacy metrics train on the synthetic data and test on the real data. Of course, there may be a baseline difficulty for the problem, so as the user guide suggests, you can baseline this score by testing on a portion of the real data (eg. 75%) and testing on the rest.


score = BinaryDecisionTreeClassifier.compute(real_data, synthetic_data, target='column_name')

# split the real data into 75% train and 25% test groups
real_data_train = real_data.sample(int(len(real_data) * 0.75))
real_data_test = real_data[~real_data.index.isin(train.index)]

baseline_score = BinaryDecisionTreeClassifier.compute(real_data_test, real_data_train, target='column_name')

Comparing the two scores will tell you how much accuracy you lose by using synthetic data vs real data.

Is your request to turn this comparison into a single metric? Or are you thinking of a different type of train/test setup?

npatki commented 2 years ago

FYI I'm moving this issue into our SDMetrics library for further triage and slightly renaming it.

We'll keep it open and use this issue to track progress.