interpretml / DiCE

Generate Diverse Counterfactual Explanations for any machine learning model.
https://interpretml.github.io/DiCE/
MIT License
1.37k stars 188 forks source link

Update explainer_base.py #424

Open praveenjune17 opened 11 months ago

praveenjune17 commented 11 months ago

Issue context I'm getting "ValueError: ('Feature', {}, 'has a value outside the dataset.')" when trying to generate counterfactuals by setting dice_ml.Data = metadata properties for each feature algorithm = genetic query_size > 1 permitted_range = None

Why the code fail for the above combination? Turns out the values of the categorical features in the query instance are not label encoded but the values in the _feature_tovary are label encoded this raises a mismatch due to which the code fails with the ValueError. This happens only with the 'genetic' method that too when the permitted_range is not supplied

Code to recreate the issue.

from sklearn.compose import ColumnTransformer from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier import pandas as pd import dice_ml from dice_ml.utils import helpers # helper functions

dataset = helpers.load_adult_income_dataset() target = dataset["income"] train_dataset, test_dataset, y_train, y_test = train_test_split(dataset, target, test_size=0.2, random_state=0, stratify=target) x_train = train_dataset.drop('income', axis=1) x_test = test_dataset.drop('income', axis=1)

d = dice_ml.Data(features={'age': [17, 90], 'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'], 'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters', 'Prof-school', 'School', 'Some-college'], 'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'], 'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'], 'race': ['Other', 'White'], 'gender': ['Female', 'Male'], 'hours_per_week': [1, 99]}, outcome_name='income')

numerical = ["age", "hours_per_week"] categorical = x_train.columns.difference(numerical) categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore'))]) transformations = ColumnTransformer( transformers=[ ('cat', categorical_transformer, categorical)])

Append classifier to preprocessing pipeline.
Now we have a full prediction pipeline.

clf = Pipeline(steps=[('preprocessor', transformations), ('classifier', RandomForestClassifier())]) model = clf.fit(x_train, y_train)

Set the number of data points required in the query set

data_point = 2 m = dice_ml.Model(model=model, backend="sklearn") exp = dice_ml.Dice(d, m, method="genetic")

query instance in the form of a dictionary; keys: feature name, values: feature value

query_instance = pd.DataFrame({'age': [22]data_point, 'workclass': ['Private']data_point, 'education': ['HS-grad']data_point, 'marital_status': ['Single']data_point, 'occupation': ['Service']data_point, 'race': ['White']data_point, 'gender': ['Female']data_point, 'hours_per_week': [45]data_point}, index=list(range(data_point)))

generate counterfactuals

dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, desired_class="opposite", initialization="random")

visualize the results

dice_exp.visualize_as_dataframe(show_only_changes=True)

Proposed fix This fix will make sure "get_features_range(permitted_range)" is executed whether or not permitted_range is supplied or not

praveenjune17 commented 10 months ago

@gaugup . pls review the test cases