visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers

interpretml / DiCE

Generate Diverse Counterfactual Explanations for any machine learning model.

MIT License

1.36k stars 188 forks source link

Query instance (original outcome : 1) ['1', 21, '2', '2', 3599, '1', '4', '1', '2', '1', '4', '3', '3', '1', '1', '2', '2', '1', '2', '1', 1] Diverse Counterfactual set (new outcome: 0.0) [1, '-', 2, 2, 17507, 1, 4, 1, 2, 1, 4, 3, 3, 1, 1, 2, 2, '2', 2, 1, 0] [1, '-', '0', 2, '-', 1, '-', 1, 2, 1, 4, 3, 3, 1, 1, 2, 2, 1, 2, 1, 0]

['credit_history', 'foreign_worker', 'housing', 'other_debtors', 'other_installment_plans', 'people_liable', 'personal_status_sex', 'purpose', 'savings', 'status', 'telephone', 'employment_duration', 'installment_rate', 'job', 'number_credits', 'present_residence', 'property'] [[0, 1, 2, 3, 4], [1, 2], [1, 2, 3], [1, 2, 3], [1, 2, 3], [1, 2], [1, 2, 3, 4], [0, 1, 2, 3, 4, 5, 6, 8, 9, 10], [1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2], [1, 2, 3, 4, 5], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]]

I have a probably related issue with categorical columns that contain integer numbers. Calling Dice.generate_counterfactuals raises:

ValueError: Found unknown categories ['9', '2', '13', '7', '5', '12', '11', '15', '18', '3', '1', '14', '8', '10', '17', '4', '16'] in column 2 during transform

I realised that Data.permitted_range already has integers of categorical columns converted to strings, that's probably the root cause of the problem. Having only number and category type columns in my dataframe, I get it fixed with:

data = dice_ml.Data(dataframe=df_train, continuous_features=df_train.select_dtypes("number").columns, outcome_name="y")
for col in df_train.select_dtypes("category").columns:
    data.permitted_range[col] = df_train[col].cat.categories

Edit: This only works for Dice(method="random") not for "genetic" or "kdtree".

Edit2: The actual culprit may be PublicData._set_feature_dtypes where each column in categorical_feature_names is converted to str before being converted to category. However when tweaking the source code and omitting the string conversion, I get another error from the genetic algorithm's LabelEncoder which encodes to int64, which in turn cannot be handled in an numpy-internal np.isnan check.

interpretml / DiCE

visualize_as_dataframe(show_only_changes=True) does not work when categorical data is composed of numbers #384