sdv-dev / SDMetrics

Metrics to evaluate quality and efficacy of synthetic datasets.
https://docs.sdv.dev/sdmetrics
MIT License
212 stars 45 forks source link

sklearn throws ValueError exception #333

Open yoid2000 opened 1 year ago

yoid2000 commented 1 year ago

Problem Description

I am working with a home-grown synthesizer that is able to synthesize relatively rare categorical values (i.e. one that occurs maybe 3 or 4 times in a table of thousands of rows).

This is all fine and good, but a problem I have is that when I want to run a model (say sdmetrics.single_table.LinearRegression.compute() on the synthetic data, it can occasionally happen that no instances of that value show up in the test data (randomly sampled from the original data), whereas some instances of that value show up in the training data (randomly sampled from the synthesized data).

This in turn causes the ML Efficacy measures to fault with a message like this:

ValueError: Found unknown categories ['fake'] in column 0 during transform

This can be avoided by setting handle_unknown='ignore' in the sklearn encoders (i.e. enc = OneHotEncoder(handle_unknown='ignore') in def fit(self, data): in class HyperTransformer():).

Unfortunately there is no way to set the handle_unknown parameter from sdmetrics. As a result, there is no way for me to complete these measures (short of hard-coding the parameter in sklearn itself). I could probably to a try-except around the efficacy measure, but this still doesn't allow the measure itself to complete.

Expected behavior

Allow the handle_unknown flag to be specified in the model.compute() calls. (Either explicitly or allowing some kind of parameter pass-through to sklean.)

npatki commented 1 year ago

Hi @yoid2000, I transferred this issue into SDMetrics as this is the underlying library that implements the metric.

I can replicate this error and will classify this as a bug.

The expectation is that the training data does contain all possible values, since this is crucial information for forming the Linear Regression model. I agree that it should be ok if the test data does not contain all possible category values.

Root Cause

This error seems to be related to #291. It appears that the transformation (preprocessing) is using the wrong dataset to fit.

Observed: The code is fitting the transformers on the test_data and then applying this to the train_data. That's why it's expecting all categories to be in the test data.

Expected: The code should fit on the train_data and then apply it to the test_data. We expect all categories to be present during training but it does not matter during test.

iamamiramine commented 6 months ago

Any updates? I am facing the same issue. Any workaround?