marcotcr / lime

Lime: Explaining the predictions of any machine learning classifier
BSD 2-Clause "Simplified" License
11.5k stars 1.79k forks source link

Missing values in training data causes "Domain error in arguments" error during explanation #572

Closed kennysong closed 3 years ago

kennysong commented 3 years ago

(There is a solution to this problem, see below.)

Problem

Let's say I create a LimeTabularExplainer as follows:

explainer = lime.lime_tabular.LimeTabularExplainer(x_train.to_numpy(), feature_names=FEATURES, class_names=CLASS_NAMES, categorical_features=CATEGORICAL_FEATURES_IDX, discretize_continuous=True)

If x_train contains missing values (i.e. NaNs) in the continuous features, the discretization code will output NaNs, which causes an explanation to fail:

exp = explainer.explain_instance(x_positive, predict_fn, num_features=100)

With the stack trace:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-46-f4af7a6e6863> in <module>
----> 1 exp = explainer.explain_instance(x_positive, predict_fn, num_features=100)

~/Downloads/scratchpad/env/lib/python3.7/site-packages/lime/lime_tabular.py in explain_instance(self, data_row, predict_fn, labels, top_labels, num_features, num_samples, distance_metric, model_regressor)
    340         # print(data_row)
    341         # print(num_samples)
--> 342         data, inverse = self.__data_inverse(data_row, num_samples)  ## BUG???
    343         if sp.sparse.issparse(data):
    344             # Note in sparse case we don't subtract mean since data would become dense

~/Downloads/scratchpad/env/lib/python3.7/site-packages/lime/lime_tabular.py in __data_inverse(self, data_row, num_samples)
    551         if self.discretizer is not None:
    552             # print(inverse[1:].shape)
--> 553             inverse[1:] = self.discretizer.undiscretize(inverse[1:])  # BUG??
    554         inverse[0] = data_row
    555         return data, inverse

~/Downloads/scratchpad/env/lib/python3.7/site-packages/lime/discretize.py in undiscretize(self, data)
    171                 # print(ret[:, feature].astype(int).sum())
    172                 ret[:, feature] = self.get_undiscretize_values(
--> 173                     feature, ret[:, feature].astype(int)  # BUG??
    174                 )
    175         return ret

~/Downloads/scratchpad/env/lib/python3.7/site-packages/lime/discretize.py in get_undiscretize_values(self, feature, values)
    155             loc=means[min_max_unequal],
    156             scale=stds[min_max_unequal],
--> 157             random_state=self.random_state  # BUG??
    158         )
    159         return ret

~/Downloads/scratchpad/env/lib/python3.7/site-packages/scipy/stats/_distn_infrastructure.py in rvs(self, *args, **kwds)
    977         cond = logical_and(self._argcheck(*args), (scale >= 0))
    978         if not np.all(cond):
--> 979             raise ValueError("Domain error in arguments.")
    980 
    981         if np.all(scale == 0):

ValueError: Domain error in arguments.

This is because self.means, self.std, self.minx, self.maxs in BaseDiscretizer contain NaNs.

Solution

Option 1

Downgrade to lime==0.1.1.34. This version seems to work correctly with missing values in the training set.

Note: I haven't verified why it works, and it may not be correct.

This may be related to #352.

Option 2

In BaseDiscretizer.__init__(), replace np.mean, np.std, np.max, np.min with the safe methods np.nanmean, np.nanstd, np.nanmax, np.nanmin.

In QuartileDiscretizer.bins(), replace np.percentile with np.nanpercentile. Do this for DecileDiscretizer, etc if you use those.

This option basically ignores the existence of missing values when computing training data statistics. I think this is fine.

The downside is that LIME cannot sample a missing value when training the local surrogate. Also, it cannot deal with a missing value in a test point when generating an explanation. If you want these, you need option 3.

Option 3

After some thought, I think the correct solution is, if a feature is missing values in the training data, add "missing" as a valid feature value in its statistics in LimeTabularExplainer. For example,

This will make LimeTabularExplainer generate correct statistics, and also allow you to generate an explanation on a test point with missing values.

It seems unlikely that someone will implement this, though. As a partial workaround, you may be able to generate the training data stats with "missing" values outside of LIME, and pass it to LimeTabularExplainer(training_data_stats=___). But to handle a missing feature in a test point, I think you still need to modify the discretization code in explain_instance().

marcotcr commented 3 years ago

Option 3 seems right.

It seems unlikely that someone will implement this, though.

This is correct :)