microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.7k stars 3.83k forks source link

[python-package] `LGBMRegressor.predict(...,pred_contrib=True)` does not average contribution from individual trees when `boosting_type='rf'` #6217

Open trendelkampschroer opened 11 months ago

trendelkampschroer commented 11 months ago

Description

For a random forest model contributions are not averaged across individual trees.

Below you can see that the contributions (plus expectation) sum to the raw prediction (sum of predictions from trees in the random forest) but not to the average of predictions from trees in the random forest.

Reproducible example

n_samples = 1000
X, y = sklearn.datasets.make_regression(n_samples=n_samples, n_features=3, random_state=42)
model = lightgbm.LGBMRegressor(boosting_type="rf", n_estimators=10, colsample_bytree=0.5)
model.fit(X, y)
>>>[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000308 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 765
[LightGBM] [Info] Number of data points in the train set: 1000, number of used features: 3
[LightGBM] [Info] Start training from score 5.136342

X = X[0:3, :]
y_hat = model.predict(X)
z_hat = model.predict(X, raw_score=True)
phi = model.predict(X, pred_contrib=True)
print(f"Prediction {y_hat=}")
>>> Prediction y_hat=array([-113.44588556,  -86.95479007,  124.66706467])
print(f"Raw prediction {z_hat=}")
>>> Raw prediction z_hat=array([-1134.45885557,  -869.54790071,  1246.67064666])
print(f"Sum of SHAP values and expectation {phi.sum(axis=1)}")
>>> Sum of SHAP values and expectation [-1134.45885557  -869.54790071  1246.67064666]

Environment info

LightGBM version or commit hash:

Command(s) you used to install LightGBM

conda install lightgbm~=4.0
trendelkampschroer commented 11 months ago

@jameslamb thanks a lot for updating the issue title and triaging the issue. I don't think this is merely a usage question, but a bug. Compare e.g. https://github.com/shap/shap/blob/4fa04f89e00b54ac649a86b755873c953c208e3f/shap/explainers/_tree.py#L405 in the SHAP package where pred_contrib=True is used to compute SHAP values and for a random forest model the computed values will be wrong, in the sense that the sum of expectation and SHAP values will not be equal to the prediction.

The documentation at https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.Booster.html#lightgbm.Booster.predict does also suggest that I can get the actual SHAP values for a random forest model using pred_contrib=True.

A possibly related issue is also documented here, https://github.com/shap/shap/issues/669.