Open hadjipantelis opened 2 years ago
thanks for reporting this, @hadjipantelis Let me have a look and try to reproduce this--the correct behavior is to return no CFs in case features_to_vary
cannot lead to a CF.
Thank you for looking this up. For the record, I tried with scikit-learn
0.24.2
in case that was one of the culprits and I got the same behaviour.
@amit-sharma Hello Amit, is there an update on this please? I tried it with ver 0.8 and the issue still remains.
Hi, I have a similar issue with a regressor, here is the MWE:
import os
import random
from urllib.request import urlretrieve
import dice_ml
from lightgbm import LGBMRegressor
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
def diabetes_df():
url = "https://www4.stat.ncsu.edu/~boos/var.select/diabetes.tab.txt"
# safety measure for MacOS, see
# https://docs.python.org/3/library/urllib.request.html#module-urllib.request
os.environ["no_proxy"] = "*"
file_name, _ = urlretrieve(url)
df = pd.read_csv(file_name, sep="\t").astype({"SEX": str}).astype({"SEX": "category"})
return df.sample(200, random_state=1)
def data_and_model(df, numerical, categorical, target_column):
np.random.seed(1)
numeric_transformer = Pipeline(steps=[("scaler", StandardScaler())])
categorical_transformer = Pipeline(steps=[("onehot", OneHotEncoder(handle_unknown="ignore"))])
transformations = ColumnTransformer(
transformers=[
("num", numeric_transformer, numerical),
("cat", categorical_transformer, categorical),
]
)
#
X = df.drop(target_column, axis=1)
y = df[target_column]
clf = Pipeline(steps=[("preprocessor", transformations), ("regressor", LGBMRegressor())])
model = clf.fit(X, y)
return X, y, model
# Data set
df = diabetes_df()
numerical = ["AGE", "BMI", "BP", "S1", "S2", "S3", "S4", "S5", "S6"]
categorical = ["SEX"]
x_train, y_train, model = data_and_model(df, numerical, categorical, "Y")
factuals = x_train[0:2]
seed = 5
random.seed(seed)
np.random.seed(seed)
# Ask for counterfactual explanations
df_for_dice = pd.concat([x_train, y_train], axis=1)
dice_data = dice_ml.Data(dataframe=df_for_dice, continuous_features=numerical, outcome_name="Y")
dice_model = dice_ml.Model(model=model, backend="sklearn", model_type="regressor")
dice_explainer = dice_ml.Dice(dice_data, dice_model, method="genetic")
features_to_vary = ["BMI", "BP", "S1", "S2", "S3", "S4", "S5", "S6"]
explanations = dice_explainer.generate_counterfactuals(
factuals,
total_CFs=5,
desired_range=[60, 90],
features_to_vary=features_to_vary,
posthoc_sparsity_algorithm="binary",
)
for example in explanations.cf_examples_list:
print("+" * 70)
print(example.test_instance_df)
print("-" * 70)
print(example.final_cfs_df)
print("-" * 70)
Column AGE
is changed in the counterfactual explanations for the second factual, even though it is not in features_to_vary
DiCE seems awesome. Thank you for your work on it!
I am trying to use DiCE with XGBoost/LightGBM but I am getting some unexpected behaviour. First and foremost, DiCE seems to "partially ignore" the list of features to vary. In the example below,
generate_counterfactuals
consistently changes a features that is not on the list.I have the suspicion that DiCE does that because they are no easy counterfactuals to find. Thanks again for your work on DiCE and let me know if further clarifications are required.
P.S.0: In both of the examples above, I also find
visualize_as_dataframe
to consistently fail if we setmethod='kdtree'
orgenetic
when we instantiate the DiCE class. I am less bothered by that at the moment asrandom
works "fine". I am mentioning it as something else that also fails and maybe is helpful when debugging.P.S.1: I have noticed similar behaviour (changing features it shouldn't) with
RandomForestClassifer
too.