Open yoid2000 opened 1 year ago
Hi @yoid2000, I transferred this issue into SDMetrics
as this is the underlying library that implements the metric.
I can replicate this error and will classify this as a bug.
The expectation is that the training data does contain all possible values, since this is crucial information for forming the Linear Regression model. I agree that it should be ok if the test data does not contain all possible category values.
This error seems to be related to #291. It appears that the transformation (preprocessing) is using the wrong dataset to fit.
Observed: The code is fitting the transformers on the test_data
and then applying this to the train_data
. That's why it's expecting all categories to be in the test data.
Expected: The code should fit on the train_data
and then apply it to the test_data
. We expect all categories to be present during training but it does not matter during test.
Any updates? I am facing the same issue. Any workaround?
Problem Description
I am working with a home-grown synthesizer that is able to synthesize relatively rare categorical values (i.e. one that occurs maybe 3 or 4 times in a table of thousands of rows).
This is all fine and good, but a problem I have is that when I want to run a model (say
sdmetrics.single_table.LinearRegression.compute()
on the synthetic data, it can occasionally happen that no instances of that value show up in the test data (randomly sampled from the original data), whereas some instances of that value show up in the training data (randomly sampled from the synthesized data).This in turn causes the ML Efficacy measures to fault with a message like this:
ValueError: Found unknown categories ['fake'] in column 0 during transform
This can be avoided by setting
handle_unknown='ignore'
in the sklearn encoders (i.e.enc = OneHotEncoder(handle_unknown='ignore')
indef fit(self, data):
inclass HyperTransformer():
).Unfortunately there is no way to set the
handle_unknown
parameter from sdmetrics. As a result, there is no way for me to complete these measures (short of hard-coding the parameter in sklearn itself). I could probably to a try-except around the efficacy measure, but this still doesn't allow the measure itself to complete.Expected behavior
Allow the
handle_unknown
flag to be specified in themodel.compute()
calls. (Either explicitly or allowing some kind of parameter pass-through to sklean.)