Issue context
I'm getting "ValueError: ('Feature', {}, 'has a value outside the dataset.')" when trying to generate counterfactuals by setting
dice_ml.Data = metadata properties for each feature
algorithm = genetic
query_size > 1
permitted_range = None
Why the code fail for the above combination?
Turns out the values of the categorical features in the query instance are not label encoded but the values in the
_feature_tovary are label encoded this raises a mismatch due to which the code fails with the ValueError. This happens only with the 'genetic' method that too when the permitted_range is not supplied
Code to recreate the issue.
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import dice_ml
from dice_ml.utils import helpers # helper functions
Issue context I'm getting "ValueError: ('Feature', {}, 'has a value outside the dataset.')" when trying to generate counterfactuals by setting dice_ml.Data = metadata properties for each feature algorithm = genetic query_size > 1 permitted_range = None
Why the code fail for the above combination? Turns out the values of the categorical features in the query instance are not label encoded but the values in the _feature_tovary are label encoded this raises a mismatch due to which the code fails with the ValueError. This happens only with the 'genetic' method that too when the permitted_range is not supplied
Code to recreate the issue.
from sklearn.compose import ColumnTransformer from sklearn.model_selection import train_test_split from sklearn.pipeline import Pipeline from sklearn.preprocessing import OneHotEncoder from sklearn.ensemble import RandomForestClassifier import pandas as pd import dice_ml from dice_ml.utils import helpers # helper functions
dataset = helpers.load_adult_income_dataset() target = dataset["income"] train_dataset, test_dataset, y_train, y_test = train_test_split(dataset, target, test_size=0.2, random_state=0, stratify=target) x_train = train_dataset.drop('income', axis=1) x_test = test_dataset.drop('income', axis=1)
d = dice_ml.Data(features={'age': [17, 90], 'workclass': ['Government', 'Other/Unknown', 'Private', 'Self-Employed'], 'education': ['Assoc', 'Bachelors', 'Doctorate', 'HS-grad', 'Masters', 'Prof-school', 'School', 'Some-college'], 'marital_status': ['Divorced', 'Married', 'Separated', 'Single', 'Widowed'], 'occupation': ['Blue-Collar', 'Other/Unknown', 'Professional', 'Sales', 'Service', 'White-Collar'], 'race': ['Other', 'White'], 'gender': ['Female', 'Male'], 'hours_per_week': [1, 99]}, outcome_name='income')
numerical = ["age", "hours_per_week"] categorical = x_train.columns.difference(numerical) categorical_transformer = Pipeline(steps=[ ('onehot', OneHotEncoder(handle_unknown='ignore'))]) transformations = ColumnTransformer( transformers=[ ('cat', categorical_transformer, categorical)])
Append classifier to preprocessing pipeline.
Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', transformations), ('classifier', RandomForestClassifier())]) model = clf.fit(x_train, y_train)
Set the number of data points required in the query set
data_point = 2 m = dice_ml.Model(model=model, backend="sklearn") exp = dice_ml.Dice(d, m, method="genetic")
query instance in the form of a dictionary; keys: feature name, values: feature value
query_instance = pd.DataFrame({'age': [22]data_point, 'workclass': ['Private']data_point, 'education': ['HS-grad']data_point, 'marital_status': ['Single']data_point, 'occupation': ['Service']data_point, 'race': ['White']data_point, 'gender': ['Female']data_point, 'hours_per_week': [45]data_point}, index=list(range(data_point)))
generate counterfactuals
dice_exp = exp.generate_counterfactuals(query_instance, total_CFs=4, desired_class="opposite", initialization="random")
visualize the results
dice_exp.visualize_as_dataframe(show_only_changes=True)
Proposed fix This fix will make sure "get_features_range(permitted_range)" is executed whether or not permitted_range is supplied or not