Open exalate-issue-sync[bot] opened 1 year ago
Michal Kurka commented: Hi [~accountid:5df225868ef01d0e539b70a4] , thank you for the bug report and the code to reproduce. I will look into the issue.
Mike Fleuren commented: Thank you @Michal Kurka! I had the impression that the difference between normal prediction and the predict_contributions() increases with the amount of levels in a factor. However, the example code isn’t made to mimic this behaviour at the moment.
JIRA Issue Migration Info
Jira Issue: PUBDEV-7516 Assignee: Michal Kurka Reporter: Mike Fleuren State: In Progress Fix Version: Backlog Attachments: N/A Development PRs: N/A
I'm working on a script that is supposed to explain the results of an XGBoost ML-model. The model contains, among others, one feature with a large number of categorical levels (approx. 2000). I noticed that when I sum the contributions generated by.predict_contributions(), the outcome +largely differs+ from the actual predictions from my model (i.e. .predict()).
According to your website, "the sum of the feature contributions and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum of the predictions of the individual trees before the inverse link function is applied to get the actual prediction. For Gaussian distribution, the sum of the contributions is equal to the model prediction." ^[[ref|https://docs.h2o.ai/h2o/latest-stable/h2o-docs/performance-and-prediction.html#predict-contributions]]^ I did not specify the distribution of the model, and the default for regression is Gaussian. ^[[ref|http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/xgboost.html]]^ Thus, the sum of the feature contributions + bias term should be equal to the model prediction.
I've attached a small piece of stand-alone example code that mimics the behaviour I'm observing, but to a lesser extend in magnitude (i.e. the differences between prediction and predict_contribution().sum() is smaller.)
Python version: 3.6.9 H2O version: 3.30.0.1/3.30.0.2
{code}import pandas as pd import h2o import from h2o.estimators.xgboost import H2OXGBoostEstimator
h2o.init()
Create dataset and model that estimates a value in the thousands-range
prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip") prostate['CATGROUP1'] = (prostate['AGE']2).asfactor() prostate['AGE'] = (prostate['AGE'](1/2)).exp() # To bring prediction values to order of magnitude of interest prostate['CAPSULE'] = prostate['CAPSULE'].asfactor() prostate['RACE'] = prostate['RACE'].asfactor() prostate['DCAPS'] = prostate['DCAPS'].asfactor() prostate['DPROS'] = prostate['DPROS'].asfactor() response_col = 'AGE' prostate_xgb = H2OXGBoostEstimator(seed=1234, ntrees=5, max_depth=50) prostate_xgb.train(x=list(range(3,prostate.shape[1])), y=response_col, training_frame=prostate)
Predict outcome & predict contributions
prostate_pred = prostate_xgb.predict(prostate.head(10)).as_data_frame() contrib = prostate_xgb.predict_contributions(prostate.head(10)).as_data_frame()
Compare contributions and actual predictions
diffs = pd.DataFrame([contrib.sum(axis=1), prostate_pred['predict']], index={'summedContributions', 'prediction'}).T diffs['difference'] = diffs['summedContributions'] - diffs['prediction'] diffs['diffFraction'] = diffs['difference'] / diffs['prediction'] diffs {code}