Closed ShaunFChen closed 2 years ago
Hi Shaun,
Thanks for pointing out this issue. I have checked the code, and it seems that the function _memory_check()
doesn't produce the correct result when the maximum tree depth is very large due to numerical errors. This leads to the out-of-memory issue when setting algorithm=v2
or auto
(_memory_check()
should figure out the out-of-memory issue for algorithm v2, and automatically switch to algorithm v1). I have fixed this issue by the latest commit https://github.com/linkedin/FastTreeSHAP/commit/fa8531502553ad5d3e3dfb9dce97a86acad41b1c.
Let me know if you still have the out-of-memory issue. Thanks!
Hi Jilei,
Thanks for your prompt reply. It's awesome for the elegant correction and the process ran as expect now. Besides the example with max_depth=100
, some of our trials used to get the segmentation fault with very low maximum tree depth while paired with other hyperparameters. For those conditions, forcing algorithm=v2
now will raise There may exist memory issue for algorithm v2. Switched to algorithm v1.
without leading segmentation fault. Although using v1 for my case is still imperfect to cause an unsolved check_additivity
issue in original shap
, I believed it should go to another issue if needed - and we can close this post. Thank you so much!!
P.S. just wondering if the check_additivity issue is possible to be fixed in fasttreeshap
algorithm? FYI, for a few open threads https://github.com/slundberg/shap/issues/1071, https://github.com/slundberg/shap/issues/1986, https://github.com/slundberg/shap/issues/941 people observed that:
There seems to be a clear connection with sample size, so it could be an accumulation of rounding errors meeting a max abs diff assertion with a hard-coded limit...
However, like https://github.com/slundberg/shap/issues/1238 following shap
author's suggestion to set check_additivity=False
, it would actually cause irrelevant results to me on XGBoost. It's fine if it's not in our scope for fasttreeshap
to fix the algorithm fundamentally, but would be great to know if there's an easy way to bypass it. Since in practical, we did already observe a much lower chance for fasttreeshap
to get that error than shap
.
Hi Shaun,
Glad to hear that the segmentation fault issue has been fixed.
Regarding your second note, actually I had a relevant discussion with shap
's author a few weeks ago, and he pointed out that there can be numerical precision issues with very deep trees (50+) (as mentioned in your notes), and fasttreeshap
might be able to increase the computational stability by using algorithm=v1
since v1 avoids some redundant computations, or by using algorithm=v2
since v2 can conduct precompution at a higher precision if needed. So at the current stage, I would suggest using v1 or v2 (if no out-of-memory issues) to increase the chance to avoid the numerical precision issues.
I believe shap
's author is actively looking into this issue based on our conversation, and I will also look into it when I have time (However the main purpose of fasttreeshap
is to develop a fast implementation to reproduce the results from shap
, so I will let shap
's author drive this effort).
Thanks again for all your detailed feedback and comments.
BTW, just curious, when you mentioned "... in practical, we did already observe a much lower chance for fasttreeshap
to get that error than shap
", do you have some (estimated) numbers on this observation, e.g., shap
produces errors 4 out of 10 times, while fasttreeshap
produces errors 2 out of 10 times? Thanks!
Hi Jilei,
Thanks for your explanation on the numerical precision issue, and it made a lot of sense!! My study actually considered shap as part of feature selector wrapped with optuna
optimization (100 trials). The pipeline was applied to train for separate estimators predicting different continuous measurements (~50 traits). Some of them would fail due to the precision error or segmentation fault by fasttreeshap
. Actually, most of the traits couldn't complete without precision error using the original shap
.
Currently I'm running the jobs distributed with the latest commit but it would take a few days to complete. I'll keep you posted once it's done.
Thanks Shaun for your information! Looking forward to your updates :)
Hi Jilei,
Thanks for your patience, after applying the latest commit, the segmentation fault issue were solved. To standardize and summarize the observation, I used the same condition described above (same machine, framework and dataset) to run a simple grid search iterating over XGBoost parameter value combinations:
d = {"lambda": [1e-8, 1],
"alpha": [1e-8, 1],
"subsample": [0.5 ,1],
"colsample_bytree": [0.5, 1],
"scale_pos_weight": [0.5, 20],
"max_depth": [6, 100],
"min_child_weight": [1, 10],
"eta": [1e-8, 1],
"gamma": [1e-8, 1],
"grow_policy": ["depthwise", "lossguide"]
}
Among the 1024 combinations, here's the number of errors happened to the original shap or fasttreeshap with different algorithm: |
shap |
fasttreeshap_v0 |
fasttreeshap_v1 |
fasttreeshap_v2_old |
fasttreeshap_v2_new |
|
---|---|---|---|---|---|---|
segmentation fault | 0 | 0 | 0 | 227 | 0 | |
precision issue | 256 | 64 | 62 | 0 | 62 |
After fixing _memory_check()
in the latest commit, now fasttreeshap
specifying algorithm=v2
(or auto
) would redirect to prioritize proper algorithm and avoid out-of-memory. The advantage of fasttreeshap
was also reproduced in this example (which is an amazing lifesaver!! 🚀) - The max_depth=100, eta=1
together was indeed the major reason causing the precision issue in both shap
and fasttreeshap
.
Although there's still a portion of conditions failed with precision issue by fastreeshap
, it mostly due to the low predictability of our target trait leading to extreme parameters during tuning. In practical, we then decided to apply approximate=True
for the large dataset since it returned similar importance ranking with Saabas algorithm. Select the top 50 features and run with approximate=Falae
again for more accurate importance explanation. It's been running well so far and propagated to other traits in different pipelines. Thank you so much for all the help!! - and also please feel free to correct me if there's any potential bias in all the tests above.
Thanks Shaun so much for the detailed description of your experiment settings, and the table with very detailed quantitative results! Really happy to see that fasttreeshap
has helped mitigate the numerical precision issues in your project. Let me know if there is anything else I can help with, and good luck with your project! :)
Hi,
I'm trying to apply the TreeExplainer to get shap_values on XGBoost model on regression problem with large dataset. During the hyperparameter tuning, it failed due to segmentation fault at the
explainer.shap_values()
step in certain hyperparameter sets. I used fasttreeshap=0.1.1, xgboost=1.4.1 (also tested 1.6.0) and the machine came with CPU:"Intel Xeon E5-2640 v4 (20) @ 3.400GHz" and Memory:"128GB". The sample code below is a toy script to reproduce the issue using the Superconductor dataset from example notebook:The
time
report of program execution also showed that the "Maximum resident set size" was only about 32GB.In some case (and the example above), forcing
TreeExplainer(algorithm="v1")
did help, which means the issue could only happen to "v2" (or "auto" passing the_memory_check()
). However, by chance the v1 would raise another check_additivity issue which remained unsolved in the original algorithm.Alternatively, passing
approximate=True
toexplainer.shap_values()
would work but have the inconsistency concerns for the reproducibility of our studies...In this case, could you help me to debug with this issue?
Thanks you so much!