microsoft / hummingbird

Hummingbird compiles trained ML models into tensor computation for faster inference.
MIT License
3.32k stars 274 forks source link

xgboost tweedie loss predictions do not match #690

Open Maggie1216 opened 1 year ago

Maggie1216 commented 1 year ago

Hi team, I'm using hummingbird_ml==0.4.3 and xgboost==1.5.2, and testing on XGBRegressor with objective reg:tweedie predictions.

import xgboost as xgb
import pandas as pd
import hummingbird
from hummingbird.ml import convert
from sklearn.datasets import *

train_x, train_y = load_diabetes(return_X_y=True)
xgb_tweedie = xgb.XGBRegressor(objective='reg:tweedie', n_estimators = 50, tweedie_variance_power = 1.8)
xgb_tweedie.fit(train_x, train_y)
print(xgb_tweedie.predict(train_x[:10]))
xgb_tweedie_torch = convert(xgb_tweedie, 'pytorch', extra_config = {'post_transform': 'TWEEDIE'})
print(xgb_tweedie_torch.predict(train_x[:10]))

It prints: [160.32375 73.65087 140.53572 208.20435 115.15947 99.853676 125.59772 64.26746 110.12681 298.41394 ] [528.6581 242.85928 463.40848 686.5432 379.7321 329.2624 414.15146 211.91847 363.13666 984.0033 ]

After some analysis (I generated 1000 different regression datasets, also tried different tweedie_variance_power, etc.), I found that the xgb_tweedie_torch (after conversion) predictions are always 3.2974 xgb_tweedie (before conversion). For example, 160.32375 3.2974 = 528.6581. I wonder why this is the case?

ksaur commented 1 year ago

Hi @Maggie1216 thank you for the detailed example! It's possible that our implementation of tweedie does not cover some case. We'll add it to the backlog!

gorkemozkaya commented 12 months ago

The constant 3.2974 happens to be 2 * exp(0.5), and 0.5 is the default base_score in XGBoost models. I suspect this discrepancy is related to how the base_score is handled in transforms.