loft-br / xgboost-survival-embeddings

Improving XGBoost survival analysis with embeddings and debiased estimators
https://loft-br.github.io/xgboost-survival-embeddings/
Apache License 2.0
321 stars 53 forks source link

SHAP explanation for XGBSEKaplanTree or bootstrapestimator. #58

Open hellorp1990 opened 2 years ago

hellorp1990 commented 2 years ago

Hi, Is it possible to use SHAP with XGBSEKaplanTree or bootstrapestimator. SHAP treeexplainer is not working with them. Permutationexplainer seems to start evaluating but ended up with error "ValueError: max_evals=1785 is too low for the Permutation explainer, it must be at least 2 * num_features + 1 = 1799!"

I am not sure how to fix this error. If anyone can point me in the right direction, it will be really helpful. THank you in advance.

davivieirab commented 2 years ago

Hi @hellorp1990 . Could you provide a code example explaining how are you trying to use XGBSE with SHAP? Are you trying to use the whole survival curve as your target or have you transformed the predict function to output a single value response?

hellorp1990 commented 2 years ago

@davivieirab My model: xgbse_model = XGBSEKaplanTree(params) bootstrap_estimator = XGBSEBootstrapEstimator(xgbse_model, n_estimators=100)

Shap model:

shap_values = shap.Explainer(bootstrap_estimator.predict, data,feature_names=feature_names,max_evals=2000) shaps = shap_values(data)

hellorp1990 commented 2 years ago

@davivieirab if i dont use the max_evals in the shap.explainer, it wont run at all. with max_evals=2000, the shap was running but it was showing 10hrs projected time to finish.

My database size was 330 rows and 900 columns and I was doing train-test split (25% for test).

davivieirab commented 2 years ago

@hellorp1990 , the output of XGBSEBootstrapEstimator is a multi-output regression problem, so for each sample you get a whole survival function with a probability of survival for each time bucket evaluated. Consequently, for each sample you will have an array of shap values (one value for each feature) for each time period.

Find a code example below - references: SHAP values for multi-output problems, using KernelSHAP with XGBoost:

import pandas as pd
import shap
from xgbse import XGBSEKaplanTree, XGBSEBootstrapEstimator

xgbse_model = XGBSEKaplanTree(your_params)
bootstrap_estimator = XGBSEBootstrapEstimator(xgbse_model, n_estimators=100)

columns = X_train.columns

## kernel shap sends data as numpy array which has no column names, so we fix it
## source: https://gist.github.com/noleto/05dfa4a691ebbc8816c035b86d2d00d4#file-shap_xgboost-py-L46
def xgbse_predict(data_asarray):
    data_asframe =  pd.DataFrame(data_asarray, columns=columns)
    return bootstrap_estimator.predict(data_asframe)

#### Kernel SHAP
shap_kernel_explainer = shap.KernelExplainer(xgbse_predict, X_train.head(100))

# Explain a single instance - output: (1, n_time_buckets, n_features)
shap_one = shap_kernel_explainer.shap_values(X_train.iloc[0])

# Get explanations for the first time bucket
first_time_bucket_shap_values = pd.Series(shap_one[0])

# Print shap values for the first time bucket and the corresponding features
print(pd.concat([first_time_bucket_shap_values, pd.Series(columns)], axis=1))

You will get something like (for the first time bucket):

shap_value feature
0.001919 x0
0.006411 x1
0.000411 x2
0.002464 x3
0.000239 x4
0.000893 x5
0.002441 x6
0.000117 x7
0.009901 x8
davivieirab commented 2 years ago

As an action item we will add a notebook with a brief documentation on how to use SHAP with the XGBSE lib

yangwei1993 commented 1 year ago

hello, davivieirab, have you added documentaion for how to use SHAP with XGBSE? when I use my code to run in the way you mentioned above, it runs into error. The following is my code: from xgbse import XGBSEDebiasedBCE

fitting xgbse model

xgbse_model = XGBSEDebiasedBCE() xgbse_model.fit(X_train, y_train, time_bins=TIME_BINS)

predicting

y_pred = xgbse_model.predict(X_test)

import shap from xgbse import XGBSEKaplanTree, XGBSEBootstrapEstimator

kernel shap sends data as numpy array which has no column names, so we fix it

source: https://gist.github.com/noleto/05dfa4a691ebbc8816c035b86d2d00d4#file-shap_xgboost-py-L46

bootstrap_estimator = XGBSEBootstrapEstimator(xgbse_model, n_estimators=100) def xgbse_predict(data_asarray): data_asframe = pd.DataFrame(data_asarray, columns=columns) return bootstrap_estimator.predict(data_asframe) columns = X_train.columns shap_kernel_explainer = shap.KernelExplainer(xgbse_predict, X_train)

Kernel SHAP

Explain a single instance - output: (1, n_time_buckets, n_features)

shap_one = shap_kernel_explainer.shap_values(X_train.iloc[0])

Get explanations for the first time bucket

first_time_bucket_shap_values = pd.Series(shap_one[0]) print(pd.concat([first_time_bucket_shap_values, pd.Series(columns)], axis=1))

Error report: Provided model function fails when applied to the provided data set. 'XGBSEBootstrapEstimator' object has no attribute 'estimators_'