marcotcr / lime

Lime: Explaining the predictions of any machine learning classifier
BSD 2-Clause "Simplified" License
11.6k stars 1.81k forks source link

LIME explain instance resulting in empty graph #243

Closed paulaceccon closed 4 years ago

paulaceccon commented 6 years ago

Hi. I'm working with a spark data frame and to be able to make use of LIME we had to make some modifications:

def new_predict_fn(data):
    sdf = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), data)
    sdf = spark.createDataFrame(sdf, schema=["id", "features"]).select("features")
    predictions = cv_model.transform(sdf).select("prediction")
    return predictions.toPandas()["prediction"].values.reshape(-1)

lime_df_test = nsdf.select('features').toPandas()
lime_df_test = pd.DataFrame.from_records(lime_df_test['features'].tolist())

exp = explainer.explain_instance(lime_df_test.iloc[10].values, new_predict_fn, num_features=20)
display(exp.as_pyplot_figure())

However, when it results in an "empty" explanation:

[('3575 <= 2199.13', 0.0), ('3981 <= 2189.88', 0.0), ('3987 <= 2189.88', 0.0), ('4527 <= 93.00', 0.0), ('4003 <= 1.00', 0.0), ('4528 <= 0.00', 0.0), ('3824 <= 14000000.00', 0.0), ('4256 <= 2199.73', 0.0), ('3685 <= 2190.45', 0.0), ('3579 <= 2199.13', 0.0)]

We are looking for some reason for this to happen. A simple test with the modifications mentioned above worked well, but using real data (with more than 3000) columns, we faced that problem. The only idea that comes to my mind is that LIME is not being able to explain an instance locally (?). But I'm not sure if that makes sense. I'm also wondering (now) if it's not just a case that the weights are plotted with 1 decimal place of precision and if (how) I could change that.

Thanks.

echo66 commented 6 years ago

Did you use any feature value bigger than the observed maximum or smaller than the minimum of the training data? Did you use sampling around the query or wait it the original sampling procedure (you can control that with a parameter)? If you did sample around the query, how big was your kernel width?

marcotcr commented 6 years ago

Please confirm that:

paulaceccon commented 6 years ago

Hi @marcotcr. These are the results:

i = 10
print(lime_df_test.iloc[i].values.shape)
print(type(lime_df_train.values)) # converted from spark to pandas
print(lime_df_train.shape)

(4552,)
<class 'numpy.ndarray'>
(34394, 4552)

This new_predict_fn(lime_df_test.iloc[10].values.reshape(-1, 1)) has shape (1, 2) fails but my model is a regressor, so I don't see how it could have shape (1, 2).

A more complete code, yet not reproducible :

lime_df_train = sdf.where(col("id").isin(list(train_ids_list))).select('features').toPandas()
lime_df_train = pd.DataFrame.from_records(lime_df_train['features'].tolist())

def new_predict_fn(data):
  sdf = map(lambda x: (int(x[0]), Vectors.dense(x[0:])), data)
  sdf = spark.createDataFrame(sdf, schema=["id", "features"]).select("features")
  predictions = cv_model.transform(sdf).select("prediction")
  return predictions.toPandas()["prediction"].values.reshape(-1)

explainer = lime.lime_tabular.LimeTabularExplainer(lime_df_train.values, "regression")

lime_df_test = sdf.where(col("CoilId").isin(list(test_ids_list))).select('features').toPandas()
lime_df_test = pd.DataFrame.from_records(lime_df_test['features'].tolist())

i = 10
exp = explainer.explain_instance(lime_df_test.iloc[i].values, new_predict_fn, num_features=10)
exp.as_list()
edwardcqian commented 5 years ago

I had a similar issue, and it seems that LIME has issue with very sparse data.

The initial I used was fairly sparse, some features has very very little non-zero rows (even 1 or 2). After removing some of the sparse features, LIME worked as intended.

platinum736 commented 5 years ago

I was getting similar problem, all my output variables had 0 weight, I decreased the num_features to 6 and 5, after that I started getting non zero weights for each criteria. My model has 1800 sparse features.

marcotcr commented 5 years ago

Sorry for the long delay in responding.

This new_predict_fn(lime_df_test.iloc[10].values.reshape(-1, 1)) has shape (1, 2) fails but my model is a regressor, so I don't see how it could have shape (1, 2).

For regression, it should be (1, ). If it is (1,), my guess is that the problem is with the distance function, as you have so many features that are sparse. Could you try the following two things (separately, then together):

  1. set distance_metric='cosine' in explain_instance
  2. set kernel_width to different values (try 1, 5, 10, 20, 30)

If neither of these work, can you please share one row of the dataset with the predicted value so I can see if there is a bug on our end?

gautambajaj commented 5 years ago

I am facing the exact same issue as the original poster. Most or all of explain_instance's output values for my tabular data are empty and are shown as 0.0

Some observations:

I have tried different combinations of suggestions from the comments above including changing kernel_width and distance_metric, but nothing seems to work consistently. The output explanations always have a bunch of 0.0 values. I'm not sure if that is expected behaviour?

Any help would be appreciated.

drubiovallejo commented 5 years ago

This was happening to me as well with a large dataset with 50,000+ very sparse feature values. In my case, it turned out that the dataset was very skewed. Once I controlled for that, I got values for the weights.

keithzeng commented 5 years ago

i am facing the same problem, is there a way that have num_features refer to the top 10 features instead of features index from 0 to num_features - 1?

marcotcr commented 5 years ago

@gautambajaj it is not expected behavior. Would you mind checking if the non-zero values you get when you lower the number of features are still very near 0? Is it possible to share any data at all? @keithzeng num_features does refer to the top K features.

hanzigs commented 5 years ago

Hi, I am having similar issue getting zeros in binary classification, Can I have some help please, Thanks

X_train.shape
Out[180]: (152491, 165)
X_test[180].shape
Out[181]: (165,)
rf_explainer = lime.lime_tabular.LimeTabularExplainer(X_train, mode='classification',training_labels=data['class'],feature_names=feature_names)
exp = rf_explainer.explain_instance(X_test[180], rf_fit.predict_proba, num_features=10)
exp.as_list()
Out[183]: 
[('Contact_Y <= 1.00', 0.0),
 ('Business_Name_Y <= 1.00', 0.0),
 ('Y1SA_Y <= 1.00', 0.0),
 ('Valuation_Acceptable_Valuation_Acceptable_miss <= 1.00', 0.0),
 ('Trading_State_WA <= 1.00', 0.0),
 ('Employer_Name_Y <= 1.00', 0.0),
 ('Trading_Unit_Number_Y <= 0.00', 0.0),
 ('Home_State_WA <= 1.00', 0.0),
 ('Applicant_Type_Applicant_Type_miss <= 1.00', 0.0),
 ('Phone_Number_N <= 0.00', 0.0)]

LIME

hanzigs commented 5 years ago

As per @platinum736 I tried to reduce the num_features to 3, I got values, which is good, Thanks for that, but when we do OneHotEncoding we get sparse data, is there a way to overcome this please,

LIME2

Also the explanation features changes for different runs, below is next run LIME3

As per the discussion https://github.com/marcotcr/lime/issues/113 i used the function as

#My Function
def explain(exp, instance, predict_fn):
  numpy.random.seed(42)
  exp_data = rf_explainer.explain_instance(X_test[102], predict_fn, num_features=3)
  exp_data.as_pyplot_figure()
  return exp_data.as_list()

#My Call
explain(rf_explainer,X_test[102], rf_fit.predict_proba)

But still i am getting different features, may i know how to use that please, Thanks LIME4

marcotcr commented 5 years ago

Can you also show the output of explain when num_features=10, i.e. the intercept, prediction_local and right?

Also, are you able to replicate this with a public dataset, or share your dataset?

hanzigs commented 5 years ago

Thanks for the reply, Actually i cant share the dataset, sorry for that, Also if i go for 10 features i get zeros.., it works upto 6 features. LIME5

hanzigs commented 5 years ago

I have the two concerns, I understand "LIME explanations are the result of a random sampling process" but for the specific application like fraud detection, in a particular transaction, if we get the explanation different every time, its difficult for the user to identify which feature is contributing to the fraud. and the sparse data issue for increasing number of features where the explanation values are zero. Can I have some suggestions please, Thanks

hanzigs commented 5 years ago

Hi marco, I have another question as well, say for binary class, I am interested only on explanation on the positive side (say class 1, Fraud), in continuation on the zeros weights for num_features more than 6 in my case, can I get explanation for 6 features or more in positive side only, so I have enough features to explain towards the class which I am interested on. Thanks

marcotcr commented 5 years ago

There is definitely something wrong (likely a bug) happening with num_features=10, but since I can't replicate it on my datasets I am not able to figure out what it is.

As to the difference in explanations, in your case it was the same features with minor weight differences. When you have two features that are highly correlated, one or the other can be picked arbitrarily (i.e. either works, so LIME will pick an arbitrary one).

hanzigs commented 5 years ago

Hi @marcotcr Can I have a help on this please, By just printing exp.as_list(), how to interpret the Positive and Negative values belongs to which class? Asking this Question because, After deploying the code in production API, retraining happens everyday, the code has to catch the classes from the exp.as_list()

exp = rf_explainer.explain_instance(test_data, rf_model.predict_proba, num_features=10)
exp.as_list()
Out[75]: 
[('Type1', -0.04246093200969212),
 ('Marital_Status', -0.03338681125783632),
 ('Type2', -0.03305282227235701),
 ('Resident', -0.027560775651420212),
 ('Type', -0.02750820474164029),
 ('Term', -0.017987123190196155),
 ('Amount', -0.01617129143862048),
 ('State', -0.014215225059126329),
 ('Employer', -0.013426744603245422),
 ('Age', -0.011125272518051208)]

exp.class_names
Out[76]: ['Good', 'Bad']

exp.predict_proba
Out[77]: array([9.99998233e-01, 1.76737855e-06])
marcotcr commented 5 years ago

exp.as_list() returns explanations for label 1 (that is the default parameter), i.e. positive is positive towards 1 and negative is negative towards 1. You can use exp.as_list(label=2) or any other value if you care about a particular label. 1 is a good default for binary classification, top_labels=1 is a good default otherwise.

marcotcr commented 4 years ago

I think I fixed the bug in 305b55b

Domemakarov2019 commented 4 years ago

@marcotcr Hi I have this problem to and i don't know how to fix this . Ps I use by flask api

`model = pickle.load(open("./model/hr.pkl", "rb")) app = flask.Flask(name, template_folder='templates')

@app.route('/', methods=['GET', 'POST']) def main(): if flask.request.method == 'GET':

Just render the initial form, to get input

    return (flask.render_template('main.html'))

if flask.request.method == "POST":
    # Extract the input
    TotalWorkingYears = flask.request.form['TotalWorkingYears']
    OverTime_code = flask.request.form['OverTime_code']
    JobInvolvement = flask.request.form['JobInvolvement']
    JobRole_code = flask.request.form['JobRole_code']
    Age = flask.request.form['Age']
    WorkLifeBalance = flask.request.form['WorkLifeBalance']
    Gender_code = flask.request.form['Gender_code']
    DistanceFromHome = flask.request.form['DistanceFromHome']
    MaritalStatus_code = flask.request.form['MaritalStatus_code']
    YearsSinceLastPromotion = flask.request.form['YearsSinceLastPromotion']
    Education = flask.request.form['Education']
    PercentSalaryHike = flask.request.form['PercentSalaryHike']
    TrainingTimesLastYear = flask.request.form['TrainingTimesLastYear']
    JobLevel = flask.request.form['JobLevel']
    YearsAtCompany = flask.request.form['YearsAtCompany']
    DailyRate = flask.request.form['DailyRate']
    YearsWithCurrManager = flask.request.form['YearsWithCurrManager']
    MonthlyIncome = flask.request.form['MonthlyIncome']
    JobSatisfaction = flask.request.form['JobSatisfaction']
    EducationField_code = flask.request.form['EducationField_code']
    RelationshipSatisfaction = flask.request.form['RelationshipSatisfaction']
    MonthlyRate = flask.request.form['MonthlyRate']
    BusinessTravel_code = flask.request.form['BusinessTravel_code']

    # Make DataFrame for model
    input_variables = pd.DataFrame([[TotalWorkingYears, OverTime_code, JobInvolvement,JobRole_code, Age, WorkLifeBalance,
                                     Gender_code, DistanceFromHome, MaritalStatus_code, YearsSinceLastPromotion,
                                     Education,PercentSalaryHike, TrainingTimesLastYear, JobLevel, YearsAtCompany, DailyRate,
                                     YearsWithCurrManager, MonthlyIncome, JobSatisfaction, EducationField_code,
                                     RelationshipSatisfaction, MonthlyRate, BusinessTravel_code]],
                                   columns=['TotalWorkingYears', 'OverTime_code', 'JobInvolvement', 'JobRole_code',
                                            'Age','WorkLifeBalance', 'Gender_code', 'DistanceFromHome','MaritalStatus_code',
                                            'YearsSinceLastPromotion','Education','PercentSalaryHike','TrainingTimesLastYear','JobLevel',
                                            'YearsAtCompany','DailyRate','YearsWithCurrManager','MonthlyIncome','JobSatisfaction',
                                            'EducationField_code','RelationshipSatisfaction','MonthlyRate','BusinessTravel_code'],
                                   dtype=float,
                                   index=['input'])

    # Get the model's prediction
    prediction = model.predict(input_variables)[0]
    prediction_percentage = model.predict_proba(input_variables)[:,1]

    row_to_show = 1
    data_for_prediction = input_variables.iloc[1]  # use 1 row of data here. Could use multiple rows if desired
    data_for_prediction_array = data_for_prediction.values.reshape(1, -1)

    model.predict_proba(data_for_prediction_array)
    X_featurenames = input_variables.columns

    categorical_features = np.argwhere(np.array([len(set(input_variables.values[0]))]))

   # uf = BytesIO()

    predict_fn = lambda x: model.predict_proba(x).astype(float)

    explainer = lime.lime_tabular.LimeTabularExplainer(input_variables.values,
    feature_names=X_featurenames,
    class_names=['Yes','No'],
    categorical_features=categorical_features,
    verbose=True, mode='classification')

    exp = explainer.explain_instance(input_variables.values[0], predict_fn, num_features=5)
    fig = exp.as_pyplot_figure()

   # plot_url = base64.b64encode(uf.getbuffer(exp)).decode("ascii")

    # Create object that can calculate shap values
    #explainer = shap.TreeExplainer(model)

   # img = StringIO()

    # Calculate Shap values
    #shap_values = explainer.shap_values(data_for_prediction_array)
    #shap.initjs()
    #shap.summary_plot(explainer.expected_value, shap_values, data_for_prediction,show=False)
    #shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction,matplotlib=True,show=False)
    if os.path.isfile("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg"):
        os.remove("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg")

          plt.savefig("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph",
            format = "svg",
            dpi = 150,
            bbox_inches = 'tight')
       # plt.savefig('/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg')
    else:
       # plt.savefig('/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph.svg')
       plt.savefig("/home/Domemakarov2013/smart_hr/static/images/shap_graph/graph",
            format = "svg",
            dpi = 150,
            bbox_inches = 'tight')`

Code not error and it save images but it a empty graph, So what wrong with it