marcotcr / lime

Lime: Explaining the predictions of any machine learning classifier
BSD 2-Clause "Simplified" License
11.41k stars 1.79k forks source link

Confusing results on simple linear regression model. #665

Closed kilasuelika closed 2 years ago

kilasuelika commented 2 years ago

I generate a simple linear model dataset and use explain_instance() to check the variable influence. I expect the influence values should be consistent to the true parameters as this model is really simple. But I find that LIME rarely gives expected results.

My code:

from lime import *
import scipy
import numpy as np

import sklearn.linear_model

import lime
import lime.lime_tabular

# Generating data
np.random.seed(5)
X=np.random.randn(10000,3)
theta=np.array([3,1,2.5])
E=np.random.randn(10000)/100
y=np.matmul(X,theta)+E

#fit linear model
model = sklearn.linear_model.LinearRegression()
train, test, labels_train, labels_test = sklearn.model_selection.train_test_split(X, y, train_size=0.80)

model.fit(train, labels_train)
print(model.coef_)

#explain
explainer = lime.lime_tabular.LimeTabularExplainer(train, verbose=True, mode='regression')

i = 21
exp = explainer.explain_instance(test[i], model.predict, num_features=3)

exp.show_in_notebook(show_table=True)

The true parameters are: [3, 1, 2.5] and the estimated parameters are : [3.00008921 0.99996862 2.49974063]. So the 0 feature should have maximum positive influence and all features should have positive influence. But LIME gives influence : [-2.07, -1.33, 0.46] which is almost irrelevant to the true parameters. Change thei won't make things better.

Did I make something wrong?

kilasuelika commented 2 years ago

Finally I understand that I should compare theta*test[i], not theta. The use print(test[i]*theta), the the order of [-2.07, -1.33, 0.46] is comparable.