Bug: Pandas DataFrames columns names not verified at prediction time

cesar-perceiv commented 3 weeks ago

Description of the problem

When training a EBC using a Pandas DataFrame as input, we expect column names to matter when calling predict_proba. Currently, it seems that EBCs don't care about the column names in the features, but only about their number. This leads to the issue that if we pass a pandas DataFrame with different columns that what was seen during training, EBC will still produce predictions.

The following code runs, but should rather raise a KeyError:

import pandas as pd
from interpret.glassbox import ExplainableBoostingClassifier

X = pd.DataFrame({
    "a": [1.0, 2.0, 3.0, 4.0],
    "b": [0.6, 0.5, -0.1, 0.2],
})
y = pd.Series([0, 0, 1, 1])

ebc = ExplainableBoostingClassifier()
ebc.fit(X, y)

# This should throw a KeyError, as the column "a" is missing in X!
probas = ebc.predict_proba(X[["b", "b"]])

Potential Solution?

In scikit-learn, this verification is performed by the _check_feature_names during _validate_data function. As these functions are core functions of the BaseEstimator class, from which the EBMModel inherits, maybe the simplest way to fix this issue would be to add a step to validate X somewhere, e.g. in _predict_score:


    def _predict_score(self, X, init_score=None):
        check_is_fitted(self, "has_fitted_")
        X = self._validate_data(X)
        ...

paulbkoch commented 3 weeks ago

Hi @cesar-perceiv -- I believe this is not correct. EBMs should handle the scenario where columns are re-arranged. If you instead used:

probas = ebc.predict_proba(X[["b", "a"]])

Then I think you would find that the predictions worked and would use the swapped positions. However, there is a fallback if the column names do not match, and in that case it uses the positions of columns. In the example above, since "a" did not exist, predict_proba knew it could not use column names, so it fell back to positional ordering.

I realize we're being more permissive than scikit-learn in this scenario. If we assume that the caller's code is correct, then there shouldn't be an issue since we only use positional ordering if ordering by name fails and, in that scenario, the only alternative would be to raise an exception. The drawback though is that it makes it more likely that an error will silently happen. I think in most cases though when this happens the predictions will be so bad that the caller will notice the issue that way.

What are your thoughts on this design now that I've outlined it above?

As an interesting side note, we should also handle the scenario where there are columns in X that are not used in the EBM. If you instead had: probas = ebc.predict_proba(X[["q", "b", "m", "a", "z"]]) It would use only columns "b" and "a". This might be useful for anyone using the ebm.remove_features(...) function.

cesar-perceiv commented 3 weeks ago

However, there is a fallback if the column names do not match, and in that case it uses the positions of columns.

I think this fallback is very error prone, and as you mentioned, this allows for errors to silently happen. I think we should try to prevent as much as possible these kind of silent errors, and they might be extremely pernicious, and could potentially go unnoticed in critical scenarios.

One of the big advantages of using pandas DataFrames over numpy Arrays in scikit-learn and InterpreML is precisely to be able to check that the features passed as input are indeed the right ones. So I think since InterpretML implements an interface very similar to scikit-learn, it should behave like it:

If the input is an numpy Array:
1. Assume that the column ordering is correct (because we have no choice to do so);
2. Check that the number of columns matches, and raise an error if it doesn't.
If the input is a pandas DataFrame:
1. Check that the column names matches, and raise an error if some columns are not found in X.

If anyone wants to use the old fallback, they can simply call to_numpy() on the input DataFrame:

y = ebc.predict_proba(X.to_numpy())

Regarding your side note, here is how scikit-learn handles it:

import pandas as pd
from sklearn.linear_model import LogisticRegression

X = pd.DataFrame({
    "a": [1.0, 2.0, 3.0, 4.0],
    "b": [0.6, 0.5, -0.1, 0.2],
    "c": [0.6, 0.5, -0.1, 0.2],
})
y = pd.Series([0, 0, 1, 1])

model = LogisticRegression()
model.fit(X[["a", "b"]], y)

probas = model.predict_proba(X)

ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- c

This case is also handled by the _validate_data function of the BaseEstimator, so relying on it would be the best way forward in my opinion.

paulbkoch commented 2 weeks ago

Thanks @cesar-perceiv. I think you're right that it would be better to raise an exception in the first case where we're currently falling back to positional ordering. to_numpy() doesn't quite handle all scenarios since there aren't great options for expressing things like pd.CategoricalDtype in numpy, but an easy alternative would be to set the columns names to match with:

X.columns = ebm.feature_names_in_

On the 2nd item about raising an exception in predict_proba if the dataframe has a superset of the features in the EBM, what kind of user error would this be useful to catch? I think that scenario has more utility to our users given that we want to promote model editing, and less downside to allow.

I want to someday remove scikit-learn from our dependencies, which scikit-learn allows via duck typing, so I'd probably make this change by modifying the InterpretML validation code instead of calling their _validate_data function. I mention this because it means that we can depart slightly from scikit-learn validation if the benefits outweigh the costs for some scenario.

cesar-perceiv commented 2 weeks ago

@paulbkoch thank for your fast answers!

I think you're right that it would be better to raise an exception in the first case where we're currently falling back to positional ordering.

Agreed!

to_numpy() doesn't quite handle all scenarios since there aren't great options for expressing things like pd.CategoricalDtype in numpy

Ah, yes you are right!

On the 2nd item about raising an exception in predict_proba if the dataframe has a superset of the features in the EBM, what kind of user error would this be useful to catch? I think that scenario has more utility to our users given that we want to promote model editing, and less downside to allow.

Personally, I like to think that every feature I input in my model is used for predictions. After all, that's what is done when relying on numpy arrays. So if I provide a pandas DataFrame with columns [a, b, c] to my model, I expect that model to use these [a, b, c] columns for prediction. If the model use only [a, b], I think it is useful to at least know about it. Maybe the model with less features was deployed to production? But as you mentioned, it might become particularly handy when editing models.

This does not necessarily needs to be a ValueError as in scikit-learn (especially if you want to move away from it), but maybe a Warning would be nice? Or this could be configured through a flag somewhere?

on_unused_inputs = "warn"  # or "raise" or "ignore"

paulbkoch commented 2 weeks ago

Thanks @cesar-perceiv, I like the warning. It doesn't slow anyone down while messing around, and should hopefully be logged or checked in production. At least it provides a hint of what might be wrong if other metrics indicate an issue. If we get enough complaints about the scenario where toy models get mistakenly deployed to production, we can revisit upgrading it to an exception.

cesar-perceiv commented 2 weeks ago

@paulbkoch sounds good to me!

interpretml / interpret

Bug: Pandas DataFrames columns names not verified at prediction time #544

Description of the problem

Potential Solution?