Categorical have to be strings

londumas commented 3 years ago

When using a dataset with categorical data, if some of these data are not strings, then the following line will produce a bug.

# generate counterfactuals
dice_exp_genetic = exp_genetic.generate_counterfactuals(query_instances,
    total_CFs=4, desired_class=desired_class)

ValueErrorTraceback (most recent call last)
<ipython-input-28-011c74faad5b> in <module>
      1 # generate counterfactuals
      2 dice_exp_genetic = exp_genetic.generate_counterfactuals(query_instances,
----> 3     total_CFs=4, desired_class=desired_class)

~/.local/lib/python3.7/site-packages/dice_ml/explainer_interfaces/explainer_base.py in generate_counterfactuals(self, query_instances, total_CFs, desired_class, desired_range, permitted_range, features_to_vary, stopping_threshold, posthoc_sparsity_param, posthoc_sparsity_algorithm, verbose, **kwargs)
    100                 posthoc_sparsity_algorithm=posthoc_sparsity_algorithm,
    101                 verbose=verbose,
--> 102                 **kwargs)
    103             cf_examples_arr.append(res)
    104         return CounterfactualExplanations(cf_examples_list=cf_examples_arr)

~/.local/lib/python3.7/site-packages/dice_ml/explainer_interfaces/dice_genetic.py in _generate_counterfactuals(self, query_instance, total_CFs, initialization, desired_range, desired_class, proximity_weight, sparsity_weight, diversity_weight, categorical_penalty, algorithm, features_to_vary, permitted_range, yloss_type, diversity_loss_type, feature_weights, stopping_threshold, posthoc_sparsity_param, posthoc_sparsity_algorithm, maxiterations, thresh, verbose)
    269         query_instance_orig = query_instance
    270         query_instance = self.data_interface.prepare_query_instance(query_instance=query_instance)
--> 271         query_instance = self.label_encode(query_instance)
    272         query_instance = np.array(query_instance.values[0])
    273         self.x1 = query_instance

~/.local/lib/python3.7/site-packages/dice_ml/explainer_interfaces/dice_genetic.py in label_encode(self, input_instance)
    524     def label_encode(self, input_instance):
    525         for column in self.data_interface.categorical_feature_names:
--> 526             input_instance[column] = self.labelencoder[column].transform(input_instance[column])
    527         return input_instance
    528 

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py in transform(self, y)
    275             return np.array([])
    276 
--> 277         _, y = _encode(y, uniques=self.classes_, encode=True)
    278         return y
    279 

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py in _encode(values, uniques, encode, check_unknown)
    120     else:
    121         return _encode_numpy(values, uniques, encode,
--> 122                              check_unknown=check_unknown)
    123 
    124 

/usr/local/lib/python3.7/dist-packages/sklearn/preprocessing/_label.py in _encode_numpy(values, uniques, encode, check_unknown)
     49             if diff:
     50                 raise ValueError("y contains previously unseen labels: %s"
---> 51                                  % str(diff))
     52         encoded = np.searchsorted(uniques, values)
     53         return uniques, encoded

ValueError: y contains previously unseen labels: [0]

The solution is then to convert all the data to strings with what follows.

    for c in lst:
        df[c] = df[c].astype(str)

One can simply test this bug with the jupyter notebook: https://github.com/interpretml/DiCE/blob/master/docs/source/notebooks/DiCE_model_agnostic_CFs.ipynb by replacing the binary feature gender by integers:

gender = dataset['gender'].to_numpy()
gender[gender=='Male'] = '0'
gender[gender=='Female'] = '1'
dataset['gender'] = gender.astype(int)
dataset.head()

amit-sharma commented 3 years ago

Thanks, the reason for the error is due to use of labelencoder from sklearn that expects a string. Having categoricals as numeric values is possible, but raises the risk of confusion in case a user does not explicitly provide the data type (and wanted it to be treated as a numerical column).

Therefore, it might be safer to pre-processes the categorical columns to be non-numeric, before passing to DiCE. That said, sometimes categorical variables can be integers. Will look to support this in a future release.

NikkiRoodenrijs commented 2 years ago

Hi!!

I got the same error, when running the function exp_genetic.generate_counterfactuals However, when I use exp_random.generate_counterfactuals, I don't get this error. Can you explain, why this error is only raised for the function exp_random.generate_counterfactuals?

Furthermore, I was trying to fix the error with the comments @amit-sharma and @londumas, but still didn't succeed in running the function. Can you possibly provide a more detailed solution?

Thank you!!

interpretml / DiCE

Categorical have to be strings #222