interpretml / interpret-community

Interpret Community extends Interpret repository with additional interpretability techniques and utility functions to handle real-world datasets and workflows.
https://interpret-community.readthedocs.io/en/latest/index.html
MIT License
419 stars 85 forks source link

Issue when initializing explainer through TabularExplainer and KernelExplainer #463

Open lucazav opened 2 years ago

lucazav commented 2 years ago

I have a trained regression model (a VotingEnsemble model obtained through training with Azure AutoML) and I'd like to generate an explainer using TabularExplainer. My dataset has a column ('CALENDAR_DATE') of type datetime64[ns], which my model handles correctly (predict method works fine). After the import of the TabularExplainer class, I tried to initialize my explainer through:

features = X_train.columns

explainer = TabularExplainer(model,
                                  X_train,
                                  features=features,
                                  model_task = 'regression')

but I get the following error:

RuntimeError: cuML is required to use GPU explainers. Check https://rapids.ai/start.html for more information on how to install it. The above exception was the direct cause of the following exception: [...] ValueError: Could not find valid explainer to explain model

I get the same error message when I force:

explainer = TabularExplainer(model,
                                  X_train,
                                  features=features,
                                  model_task = 'regression',
                                  use_gpu=False)

Thus, I proceeded trying to explicitly inizialize a KernelExplainer, thorugh:

explainer = KernelExplainer(model,
                                 X_train,
                                 features=features,
model_task = 'regression')

but I received the error:

float() argument must be a string or a number, not 'Timestamp'

Therefore I changed the 'CALENDAR_DATE' column type to string, with:

X_train_copy = X_train.copy()
X_train_copy['CALENDAR_DATE'] = X_train_copy['CALENDAR_DATE'].astype(str)

After this, both TabularExplainer and KernelExplainer correctly work when initializing the explainers (with the modified dataset X_train_copy).

Why does this happen?

imatiach-msft commented 2 years ago

@lucazav I believe the issue with the bad cuML error message appeared in interpret-community 0.18.0 and was fixed in 0.21.0:

https://github.com/interpretml/interpret-community/pull/450 See description:

Based on experience of debugging with customer, when TabularExplainer fails with default use_gpu=False on GPUKernelExplainer it prints the last warning, even though it will always fail. This PR separates it out so it only runs when use_gpu flag is on. The previous logic would skip every explainer if use_gpu=True other than GPUKernelExplainer, but still for some reason run it even if use_gpu=False. By separating it out, the customer will once again see the most useful error message from the last default catch-all KernelExplainer.'

With latest version you will see the error you saw: float() argument must be a string or a number, not 'Timestamp'

This seems to be due to the timestamp column. However, it seems like the explainers should be able to support this datatype, based on:

https://github.com/interpretml/interpret-community/blob/master/python/interpret_community/dataset/dataset_wrapper.py#L25

It should automatically featurize the timestamp column and explain numeric fields:

                tmp_dataset[time_col_name + '_year'] = tmp_dataset[time_col_name].map(lambda x: x.year)
                tmp_dataset[time_col_name + '_month'] = tmp_dataset[time_col_name].map(lambda x: x.month)
                tmp_dataset[time_col_name + '_day'] = tmp_dataset[time_col_name].map(lambda x: x.day)
                tmp_dataset[time_col_name + '_hour'] = tmp_dataset[time_col_name].map(lambda x: x.hour)
                tmp_dataset[time_col_name + '_minute'] = tmp_dataset[time_col_name].map(lambda x: x.minute)
                tmp_dataset[time_col_name + '_second'] = tmp_dataset[time_col_name].map(lambda x: x.second)

I think I see the problem. This only exists on mimic explainer, based on this search: https://github.com/interpretml/interpret-community/search?q=apply_timestamp_featurizer

So basically all other explainers (except for MimicExplainer) can't handle timestamp type column. You can convert the column to numeric (eg the float value in seconds), but for some explainers like LIME explainer it won't work well - specifically for LIME you won't be able to sample around the value correctly to get meaningful results. For KernelExplainer it might work more sensibly, since it's just replacing the value with the background data and not trying to change it, and the feature importance might be correct in the sense of how important the column is but difficult to intepret in the sense that increasing/decreasing the value will result in a specific change to the output (which you can't assume anyway with shap values, but which will be especially difficult to assume here), since there may be many cyclical/seasonal complex relationships for the time feature. I think it's more useful to break the time feature into the components like above and view feature importances in terms of day/hour/month/etc to get a better understanding of how it may influence the model's output.