Open marianokamp opened 3 years ago
Hi @marianokamp, nice to meet you. I don't fully follow the request --
Right now the ML efficacy metrics train on the synthetic data and test on the real data. Of course, there may be a baseline difficulty for the problem, so as the user guide suggests, you can baseline this score by testing on a portion of the real data (eg. 75%) and testing on the rest.
score = BinaryDecisionTreeClassifier.compute(real_data, synthetic_data, target='column_name')
# split the real data into 75% train and 25% test groups
real_data_train = real_data.sample(int(len(real_data) * 0.75))
real_data_test = real_data[~real_data.index.isin(train.index)]
baseline_score = BinaryDecisionTreeClassifier.compute(real_data_test, real_data_train, target='column_name')
Comparing the two scores will tell you how much accuracy you lose by using synthetic data vs real data.
Is your request to turn this comparison into a single metric? Or are you thinking of a different type of train/test setup?
FYI I'm moving this issue into our SDMetrics library for further triage and slightly renaming it.
We'll keep it open and use this issue to track progress.
Problem Description
Right now the metrics are computed based on real data vs synthetic data for ML efficacy. While this information is perfect to gauge if a model could be fit that is good enough, it would also be interesting to learn how much performance we lose because of the synthesization.
Expected behavior
Not sure how to best integrate it with the other metrics? Maybe as additional return values? To stay backwards compatible, returning them conditionen on the caller adding the arg
compute_relative_performance=True
tocompute
?vs
Anyhow the result I'd like to see is something like this:
Resulting in:
Additional context
I am a bit at a loss here if it is ok to compare both models so directly, as the SDV generation process may produce NaNs and infinities, that are silently replaced in the evaluation code, but may still have an impact.