Closed paulaceccon closed 4 years ago
Did you use any feature value bigger than the observed maximum or smaller than the minimum of the training data? Did you use sampling around the query or wait it the original sampling procedure (you can control that with a parameter)? If you did sample around the query, how big was your kernel width?
Please confirm that:
Hi @marcotcr. These are the results:
i = 10
print(lime_df_test.iloc[i].values.shape)
print(type(lime_df_train.values)) # converted from spark to pandas
print(lime_df_train.shape)
(4552,)
<class 'numpy.ndarray'>
(34394, 4552)
This new_predict_fn(lime_df_test.iloc[10].values.reshape(-1, 1))
has shape (1, 2) fails but my model is a regressor, so I don't see how it could have shape (1, 2).
A more complete code, yet not reproducible :
lime_df_train = sdf.where(col("id").isin(list(train_ids_list))).select('features').toPandas()
lime_df_train = pd.DataFrame.from_records(lime_df_train['features'].tolist())
def new_predict_fn(data):
sdf = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), data)
sdf = spark.createDataFrame(sdf, schema=["id", "features"]).select("features")
predictions = cv_model.transform(sdf).select("prediction")
return predictions.toPandas()["prediction"].values.reshape(-1)
explainer = lime.lime_tabular.LimeTabularExplainer(lime_df_train.values, "regression")
lime_df_test = sdf.where(col("CoilId").isin(list(test_ids_list))).select('features').toPandas()
lime_df_test = pd.DataFrame.from_records(lime_df_test['features'].tolist())
i = 10
exp = explainer.explain_instance(lime_df_test.iloc[i].values, new_predict_fn, num_features=10)
exp.as_list()
I had a similar issue, and it seems that LIME has issue with very sparse data.
The initial I used was fairly sparse, some features has very very little non-zero rows (even 1 or 2). After removing some of the sparse features, LIME worked as intended.
I was getting similar problem, all my output variables had 0 weight, I decreased the num_features to 6 and 5, after that I started getting non zero weights for each criteria. My model has 1800 sparse features.
Sorry for the long delay in responding.
This new_predict_fn(lime_df_test.iloc[10].values.reshape(-1, 1)) has shape (1, 2) fails but my model is a regressor, so I don't see how it could have shape (1, 2).
For regression, it should be (1, ). If it is (1,), my guess is that the problem is with the distance function, as you have so many features that are sparse. Could you try the following two things (separately, then together):
distance_metric='cosine'
in explain_instanceIf neither of these work, can you please share one row of the dataset with the predicted value so I can see if there is a bug on our end?
I am facing the exact same issue as the original poster. Most or all of explain_instance's output values for my tabular data are empty and are shown as 0.0
Some observations:
I have tried different combinations of suggestions from the comments above including changing kernel_width and distance_metric, but nothing seems to work consistently. The output explanations always have a bunch of 0.0 values. I'm not sure if that is expected behaviour?
Any help would be appreciated.
This was happening to me as well with a large dataset with 50,000+ very sparse feature values. In my case, it turned out that the dataset was very skewed. Once I controlled for that, I got values for the weights.
i am facing the same problem, is there a way that have num_features refer to the top 10 features instead of features index from 0 to num_features - 1?
@gautambajaj it is not expected behavior. Would you mind checking if the non-zero values you get when you lower the number of features are still very near 0? Is it possible to share any data at all?
@keithzeng num_features
does refer to the top K features.
Hi, I am having similar issue getting zeros in binary classification, Can I have some help please, Thanks
X_train.shape
Out[180]: (152491, 165)
X_test[180].shape
Out[181]: (165,)
rf_explainer = lime.lime_tabular.LimeTabularExplainer(X_train, mode='classification',training_labels=data['class'],feature_names=feature_names)
exp = rf_explainer.explain_instance(X_test[180], rf_fit.predict_proba, num_features=10)
exp.as_list()
Out[183]:
[('Contact_Y <= 1.00', 0.0),
('Business_Name_Y <= 1.00', 0.0),
('Y1SA_Y <= 1.00', 0.0),
('Valuation_Acceptable_Valuation_Acceptable_miss <= 1.00', 0.0),
('Trading_State_WA <= 1.00', 0.0),
('Employer_Name_Y <= 1.00', 0.0),
('Trading_Unit_Number_Y <= 0.00', 0.0),
('Home_State_WA <= 1.00', 0.0),
('Applicant_Type_Applicant_Type_miss <= 1.00', 0.0),
('Phone_Number_N <= 0.00', 0.0)]
As per @platinum736 I tried to reduce the num_features to 3, I got values, which is good, Thanks for that, but when we do OneHotEncoding we get sparse data, is there a way to overcome this please,
Also the explanation features changes for different runs, below is next run
As per the discussion https://github.com/marcotcr/lime/issues/113 i used the function as
#My Function
def explain(exp, instance, predict_fn):
numpy.random.seed(42)
exp_data = rf_explainer.explain_instance(X_test[102], predict_fn, num_features=3)
exp_data.as_pyplot_figure()
return exp_data.as_list()
#My Call
explain(rf_explainer,X_test[102], rf_fit.predict_proba)
But still i am getting different features, may i know how to use that please, Thanks
Can you also show the output of explain
when num_features=10
, i.e. the intercept, prediction_local and right?
Also, are you able to replicate this with a public dataset, or share your dataset?
Thanks for the reply, Actually i cant share the dataset, sorry for that, Also if i go for 10 features i get zeros.., it works upto 6 features.
I have the two concerns, I understand "LIME explanations are the result of a random sampling process" but for the specific application like fraud detection, in a particular transaction, if we get the explanation different every time, its difficult for the user to identify which feature is contributing to the fraud. and the sparse data issue for increasing number of features where the explanation values are zero. Can I have some suggestions please, Thanks
Hi marco, I have another question as well, say for binary class, I am interested only on explanation on the positive side (say class 1, Fraud), in continuation on the zeros weights for num_features more than 6 in my case, can I get explanation for 6 features or more in positive side only, so I have enough features to explain towards the class which I am interested on. Thanks
There is definitely something wrong (likely a bug) happening with num_features=10
, but since I can't replicate it on my datasets I am not able to figure out what it is.
As to the difference in explanations, in your case it was the same features with minor weight differences. When you have two features that are highly correlated, one or the other can be picked arbitrarily (i.e. either works, so LIME will pick an arbitrary one).
Hi @marcotcr Can I have a help on this please, By just printing exp.as_list(), how to interpret the Positive and Negative values belongs to which class? Asking this Question because, After deploying the code in production API, retraining happens everyday, the code has to catch the classes from the exp.as_list()
exp = rf_explainer.explain_instance(test_data, rf_model.predict_proba, num_features=10)
exp.as_list()
Out[75]:
[('Type1', -0.04246093200969212),
('Marital_Status', -0.03338681125783632),
('Type2', -0.03305282227235701),
('Resident', -0.027560775651420212),
('Type', -0.02750820474164029),
('Term', -0.017987123190196155),
('Amount', -0.01617129143862048),
('State', -0.014215225059126329),
('Employer', -0.013426744603245422),
('Age', -0.011125272518051208)]
exp.class_names
Out[76]: ['Good', 'Bad']
exp.predict_proba
Out[77]: array([9.99998233e-01, 1.76737855e-06])
exp.as_list()
returns explanations for label 1 (that is the default parameter), i.e. positive is positive towards 1 and negative is negative towards 1. You can use exp.as_list(label=2)
or any other value if you care about a particular label. 1 is a good default for binary classification, top_labels=1
is a good default otherwise.
I think I fixed the bug in 305b55b
@marcotcr Hi I have this problem to and i don't know how to fix this . Ps I use by flask api
`model = pickle.load(open("./model/hr.pkl", "rb")) app = flask.Flask(name, template_folder='templates')
@app.route('/', methods=['GET', 'POST']) def main(): if flask.request.method == 'GET':
return (flask.render_template('main.html'))
if flask.request.method == "POST":
# Extract the input
TotalWorkingYears = flask.request.form['TotalWorkingYears']
OverTime_code = flask.request.form['OverTime_code']
JobInvolvement = flask.request.form['JobInvolvement']
JobRole_code = flask.request.form['JobRole_code']
Age = flask.request.form['Age']
WorkLifeBalance = flask.request.form['WorkLifeBalance']
Gender_code = flask.request.form['Gender_code']
DistanceFromHome = flask.request.form['DistanceFromHome']
MaritalStatus_code = flask.request.form['MaritalStatus_code']
YearsSinceLastPromotion = flask.request.form['YearsSinceLastPromotion']
Education = flask.request.form['Education']
PercentSalaryHike = flask.request.form['PercentSalaryHike']
TrainingTimesLastYear = flask.request.form['TrainingTimesLastYear']
JobLevel = flask.request.form['JobLevel']
YearsAtCompany = flask.request.form['YearsAtCompany']
DailyRate = flask.request.form['DailyRate']
YearsWithCurrManager = flask.request.form['YearsWithCurrManager']
MonthlyIncome = flask.request.form['MonthlyIncome']
JobSatisfaction = flask.request.form['JobSatisfaction']
EducationField_code = flask.request.form['EducationField_code']
RelationshipSatisfaction = flask.request.form['RelationshipSatisfaction']
MonthlyRate = flask.request.form['MonthlyRate']
BusinessTravel_code = flask.request.form['BusinessTravel_code']
# Make DataFrame for model
input_variables = pd.DataFrame([[TotalWorkingYears, OverTime_code, JobInvolvement,JobRole_code, Age, WorkLifeBalance,
Gender_code, DistanceFromHome, MaritalStatus_code, YearsSinceLastPromotion,
Education,PercentSalaryHike, TrainingTimesLastYear, JobLevel, YearsAtCompany, DailyRate,
YearsWithCurrManager, MonthlyIncome, JobSatisfaction, EducationField_code,
RelationshipSatisfaction, MonthlyRate, BusinessTravel_code]],
columns=['TotalWorkingYears', 'OverTime_code', 'JobInvolvement', 'JobRole_code',
'Age','WorkLifeBalance', 'Gender_code', 'DistanceFromHome','MaritalStatus_code',
'YearsSinceLastPromotion','Education','PercentSalaryHike','TrainingTimesLastYear','JobLevel',
'YearsAtCompany','DailyRate','YearsWithCurrManager','MonthlyIncome','JobSatisfaction',
'EducationField_code','RelationshipSatisfaction','MonthlyRate','BusinessTravel_code'],
dtype=float,
index=['input'])
# Get the model's prediction
prediction = model.predict(input_variables)[0]
prediction_percentage = model.predict_proba(input_variables)[:,1]
row_to_show = 1
data_for_prediction = input_variables.iloc[1] # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)
model.predict_proba(data_for_prediction_array)
X_featurenames = input_variables.columns
categorical_features = np.argwhere(np.array([len(set(input_variables.values[0]))]))
# uf = BytesIO()
predict_fn = lambda x: model.predict_proba(x).astype(float)
explainer = lime.lime_tabular.LimeTabularExplainer(input_variables.values,
feature_names=X_featurenames,
class_names=['Yes','No'],
categorical_features=categorical_features,
verbose=True, mode='classification')
exp = explainer.explain_instance(input_variables.values[0], predict_fn, num_features=5)
fig = exp.as_pyplot_figure()
# plot_url = base64.b64encode(uf.getbuffer(exp)).decode("ascii")
# Create object that can calculate shap values
#explainer = shap.TreeExplainer(model)
# img = StringIO()
# Calculate Shap values
#shap_values = explainer.shap_values(data_for_prediction_array)
#shap.initjs()
#shap.summary_plot(explainer.expected_value, shap_values, data_for_prediction,show=False)
#shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction,matplotlib=True,show=False)
if os.path.isfile("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg"):
os.remove("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg")
plt.savefig("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph",
format = "svg",
dpi = 150,
bbox_inches = 'tight')
# plt.savefig('/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg')
else:
# plt.savefig('/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg')
plt.savefig("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph",
format = "svg",
dpi = 150,
bbox_inches = 'tight')`
Code not error and it save images but it a empty graph, So what wrong with it
Hi. I'm working with a spark data frame and to be able to make use of LIME we had to make some modifications:
However, when it results in an "empty" explanation:
[('3575 <= 2199.13', 0.0), ('3981 <= 2189.88', 0.0), ('3987 <= 2189.88', 0.0), ('4527 <= 93.00', 0.0), ('4003 <= 1.00', 0.0), ('4528 <= 0.00', 0.0), ('3824 <= 14000000.00', 0.0), ('4256 <= 2199.73', 0.0), ('3685 <= 2190.45', 0.0), ('3579 <= 2199.13', 0.0)]
We are looking for some reason for this to happen. A simple test with the modifications mentioned above worked well, but using real data (with more than 3000) columns, we faced that problem. The only idea that comes to my mind is that LIME is not being able to explain an instance locally (?). But I'm not sure if that makes sense. I'm also wondering (now) if it's not just a case that the weights are plotted with 1 decimal place of precision and if (how) I could change that.
Thanks.