exalate-issue-sync[bot] commented 1 year ago

I'm working on a script that is supposed to explain the results of an XGBoost ML-model. The model contains, among others, one feature with a large number of categorical levels (approx. 2000). I noticed that when I sum the contributions generated by .predict_contributions(), the outcome +largely differs+ from the actual predictions from my model (i.e. .predict()).

According to your website, "the sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum of the predictions of the individual trees before the inverse link function is applied to get the actual prediction. For Gaussian distribution, the sum of the contributions is equal to the model prediction." ^[[ref|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#predict-contributions]]^ I did not specify the distribution of the model, and the default for regression is Gaussian. ^[[ref|http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html]]^ Thus, the sum of the feature contributions + bias term should be equal to the model prediction.

I've attached a small piece of stand-alone example code that mimics the behaviour I'm observing, but to a lesser extend in magnitude (i.e. the differences between prediction and predict_contribution().sum() is smaller.)

Python version: 3.6.9 H2O version: 3.30.0.1/3.30.0.2

{code}import pandas as pd import h2o import from h2o.estimators.xgboost import H2OXGBoostEstimator

h2o.init()

Create dataset and model that estimates a value in the thousands-range

prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") prostate['CATGROUP1'] = (prostate['AGE']2).asfactor() prostate['AGE'] = (prostate['AGE'](1/2)).exp() # To bring prediction values to order of magnitude of interest prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() prostate['RACE'] = prostate['RACE'].asfactor() prostate['DCAPS'] = prostate['DCAPS'].asfactor() prostate['DPROS'] = prostate['DPROS'].asfactor() response_col = 'AGE' prostate_xgb = H2OXGBoostEstimator(seed=1234, ntrees=5, max_depth=50) prostate_xgb.train(x=list(range(3,prostate.shape[1])), y=response_col, training_frame=prostate)

Predict outcome & predict contributions

prostate_pred = prostate_xgb.predict(prostate.head(10)).as_data_frame() contrib = prostate_xgb.predict_contributions(prostate.head(10)).as_data_frame()

Compare contributions and actual predictions

diffs = pd.DataFrame([contrib.sum(axis=1), prostate_pred['predict']], index={'summedContributions', 'prediction'}).T diffs['difference'] = diffs['summedContributions'] - diffs['prediction'] diffs['diffFraction'] = diffs['difference'] / diffs['prediction'] diffs {code}

exalate-issue-sync[bot] commented 1 year ago

Michal Kurka commented: Hi [~accountid:5df225868ef01d0e539b70a4] , thank you for the bug report and the code to reproduce. I will look into the issue.

exalate-issue-sync[bot] commented 1 year ago

Mike Fleuren commented: Thank you @Michal Kurka! I had the impression that the difference between normal prediction and the predict_contributions() increases with the amount of levels in a factor. However, the example code isn’t made to mimic this behaviour at the moment.

h2o-ops commented 1 year ago

JIRA Issue Migration Info

Jira Issue: PUBDEV-7516 Assignee: Michal Kurka Reporter: Mike Fleuren State: In Progress Fix Version: Backlog Attachments: N/A Development PRs: N/A

h2oai / h2o-3

predict_contributions() produces wrong results for categorical features (XGBoost) #8122

Create dataset and model that estimates a value in the thousands-range

Predict outcome & predict contributions

Compare contributions and actual predictions