linkedin / FastTreeSHAP

Fast SHAP value computation for interpreting tree-based models
BSD 2-Clause "Simplified" License
500 stars 30 forks source link

Segmentation fault (core dumped) for shap_values #5

Closed ShaunFChen closed 2 years ago

ShaunFChen commented 2 years ago

Hi,

I'm trying to apply the TreeExplainer to get shap_values on XGBoost model on regression problem with large dataset. During the hyperparameter tuning, it failed due to segmentation fault at the explainer.shap_values() step in certain hyperparameter sets. I used fasttreeshap=0.1.1, xgboost=1.4.1 (also tested 1.6.0) and the machine came with CPU:"Intel Xeon E5-2640 v4 (20) @ 3.400GHz" and Memory:"128GB". The sample code below is a toy script to reproduce the issue using the Superconductor dataset from example notebook:

# for debugging
import faulthandler
faulthandler.enable()

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import xgboost as xgb
import fasttreeshap

print(f"XGBoost version: {xgb.__version__}")
print(f"fasttreeshap version: {fasttreeshap.__version__}")

# source of data: https://archive.ics.uci.edu/ml/datasets/superconductivty+data
data = pd.read_csv("FastTreeSHAP/data/superconductor_train.csv", engine = "python")
train, test = train_test_split(data, test_size = 0.5, random_state = 0)
label_train = train["critical_temp"]
label_test = test["critical_temp"]
train = train.iloc[:, :-1]
test = test.iloc[:, :-1]

print("train XGBoost model")
xgb_model = xgb.XGBRegressor(
    max_depth = 100, n_estimators = 200, learning_rate = 0.1, n_jobs = -1, alpha = 0.12, random_state = 0)
xgb_model.fit(train, label_train)

print("run TreeExplainer()")
shap_explainer = fasttreeshap.TreeExplainer(xgb_model)

print("run shap_values()")
shap_values = shap_explainer.shap_values(train)

The time report of program execution also showed that the "Maximum resident set size" was only about 32GB.

~$ /usr/bin/time -v python segfault.py 
XGBoost version: 1.4.1
fasttreeshap version: 0.1.1
train XGBoost model
run TreeExplainer()
run shap_values()
Fatal Python error: Segmentation fault

Thread 0x00007ff2c2793740 (most recent call first):
  File "~/.local/lib/python3.8/site-packages/fasttreeshap/explainers/_tree.py", line 459 in shap_values
  File "segfault.py", line 27 in <module>
Segmentation fault (core dumped)

Command terminated by signal 11
        Command being timed: "python segfault.py"
        User time (seconds): 333.65
        System time (seconds): 27.79
        Percent of CPU this job got: 797%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:45.30
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 33753096
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 8188488
        Voluntary context switches: 3048
        Involuntary context switches: 3089
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

In some case (and the example above), forcing TreeExplainer(algorithm="v1") did help, which means the issue could only happen to "v2" (or "auto" passing the _memory_check()). However, by chance the v1 would raise another check_additivity issue which remained unsolved in the original algorithm.

Alternatively, passing approximate=True to explainer.shap_values() would work but have the inconsistency concerns for the reproducibility of our studies...

In this case, could you help me to debug with this issue?

Thanks you so much!

jlyang1990 commented 2 years ago

Hi Shaun,

Thanks for pointing out this issue. I have checked the code, and it seems that the function _memory_check() doesn't produce the correct result when the maximum tree depth is very large due to numerical errors. This leads to the out-of-memory issue when setting algorithm=v2 or auto (_memory_check() should figure out the out-of-memory issue for algorithm v2, and automatically switch to algorithm v1). I have fixed this issue by the latest commit https://github.com/linkedin/FastTreeSHAP/commit/fa8531502553ad5d3e3dfb9dce97a86acad41b1c.

Let me know if you still have the out-of-memory issue. Thanks!

ShaunFChen commented 2 years ago

Hi Jilei,

Thanks for your prompt reply. It's awesome for the elegant correction and the process ran as expect now. Besides the example with max_depth=100, some of our trials used to get the segmentation fault with very low maximum tree depth while paired with other hyperparameters. For those conditions, forcing algorithm=v2 now will raise There may exist memory issue for algorithm v2. Switched to algorithm v1. without leading segmentation fault. Although using v1 for my case is still imperfect to cause an unsolved check_additivity issue in original shap, I believed it should go to another issue if needed - and we can close this post. Thank you so much!!

P.S. just wondering if the check_additivity issue is possible to be fixed in fasttreeshap algorithm? FYI, for a few open threads https://github.com/slundberg/shap/issues/1071, https://github.com/slundberg/shap/issues/1986, https://github.com/slundberg/shap/issues/941 people observed that:

There seems to be a clear connection with sample size, so it could be an accumulation of rounding errors meeting a max abs diff assertion with a hard-coded limit...

However, like https://github.com/slundberg/shap/issues/1238 following shap author's suggestion to set check_additivity=False, it would actually cause irrelevant results to me on XGBoost. It's fine if it's not in our scope for fasttreeshap to fix the algorithm fundamentally, but would be great to know if there's an easy way to bypass it. Since in practical, we did already observe a much lower chance for fasttreeshap to get that error than shap.

jlyang1990 commented 2 years ago

Hi Shaun,

Glad to hear that the segmentation fault issue has been fixed.

Regarding your second note, actually I had a relevant discussion with shap's author a few weeks ago, and he pointed out that there can be numerical precision issues with very deep trees (50+) (as mentioned in your notes), and fasttreeshap might be able to increase the computational stability by using algorithm=v1 since v1 avoids some redundant computations, or by using algorithm=v2 since v2 can conduct precompution at a higher precision if needed. So at the current stage, I would suggest using v1 or v2 (if no out-of-memory issues) to increase the chance to avoid the numerical precision issues.

I believe shap's author is actively looking into this issue based on our conversation, and I will also look into it when I have time (However the main purpose of fasttreeshap is to develop a fast implementation to reproduce the results from shap, so I will let shap's author drive this effort).

Thanks again for all your detailed feedback and comments.

jlyang1990 commented 2 years ago

BTW, just curious, when you mentioned "... in practical, we did already observe a much lower chance for fasttreeshap to get that error than shap", do you have some (estimated) numbers on this observation, e.g., shap produces errors 4 out of 10 times, while fasttreeshap produces errors 2 out of 10 times? Thanks!

ShaunFChen commented 2 years ago

Hi Jilei,

Thanks for your explanation on the numerical precision issue, and it made a lot of sense!! My study actually considered shap as part of feature selector wrapped with optuna optimization (100 trials). The pipeline was applied to train for separate estimators predicting different continuous measurements (~50 traits). Some of them would fail due to the precision error or segmentation fault by fasttreeshap. Actually, most of the traits couldn't complete without precision error using the original shap.

Currently I'm running the jobs distributed with the latest commit but it would take a few days to complete. I'll keep you posted once it's done.

jlyang1990 commented 2 years ago

Thanks Shaun for your information! Looking forward to your updates :)

ShaunFChen commented 2 years ago

Hi Jilei,

Thanks for your patience, after applying the latest commit, the segmentation fault issue were solved. To standardize and summarize the observation, I used the same condition described above (same machine, framework and dataset) to run a simple grid search iterating over XGBoost parameter value combinations:

d = {"lambda": [1e-8, 1],
     "alpha": [1e-8, 1],
     "subsample": [0.5 ,1],
     "colsample_bytree": [0.5, 1],
     "scale_pos_weight": [0.5, 20],
     "max_depth": [6, 100],
     "min_child_weight": [1, 10],
     "eta": [1e-8, 1],
     "gamma": [1e-8, 1],
     "grow_policy": ["depthwise", "lossguide"]
    }
Among the 1024 combinations, here's the number of errors happened to the original shap or fasttreeshap with different algorithm: shap fasttreeshap_v0 fasttreeshap_v1 fasttreeshap_v2_old fasttreeshap_v2_new
segmentation fault 0 0 0 227 0
precision issue 256 64 62 0 62

After fixing _memory_check() in the latest commit, now fasttreeshap specifying algorithm=v2 (or auto) would redirect to prioritize proper algorithm and avoid out-of-memory. The advantage of fasttreeshap was also reproduced in this example (which is an amazing lifesaver!! 🚀) - The max_depth=100, eta=1 together was indeed the major reason causing the precision issue in both shap and fasttreeshap.

Although there's still a portion of conditions failed with precision issue by fastreeshap, it mostly due to the low predictability of our target trait leading to extreme parameters during tuning. In practical, we then decided to apply approximate=True for the large dataset since it returned similar importance ranking with Saabas algorithm. Select the top 50 features and run with approximate=Falae again for more accurate importance explanation. It's been running well so far and propagated to other traits in different pipelines. Thank you so much for all the help!! - and also please feel free to correct me if there's any potential bias in all the tests above.

jlyang1990 commented 2 years ago

Thanks Shaun so much for the detailed description of your experiment settings, and the table with very detailed quantitative results! Really happy to see that fasttreeshap has helped mitigate the numerical precision issues in your project. Let me know if there is anything else I can help with, and good luck with your project! :)