guitargeek / XGBoost-FastForest

Minimal library code to deploy XGBoost models in C++.
MIT License
89 stars 30 forks source link

Results of binary logistic regression mismatched with xgboost #20

Closed andriiknu closed 1 year ago

andriiknu commented 1 year ago

Hello,

I'm currently working on a project that involves using FastForest, and as part of my validation process, I've been comparing inference results between XGBoost and FastForest using a single vector. However, I've come across an unexpected issue that I'm seeking assistance with.

During my experiment, I noticed that when I use the 'binary:logistic' objective in XGBoost, the predicted values differ from those obtained using FastForest. Strangely, when I switch to the 'binary:logitraw' objective in XGBoost, the predicted scores align with those from FastForest.

I suspect that the difference might be due to distinct logistic transformations applied in XGBoost and FastForest. To address this, I've tried exploring XGBoost's documentation for details about the logistic transformation applied with the 'binary:logistic' objective. Unfortunately, I couldn't find the specific information I was looking for.

In the example provided in the Readme for FastForest, a sigmoid transformation is explicitly applied to the score obtained from the model. Based on this, I assumed that XGBoost should also apply a sigmoid transformation for the 'binary:logistic' objective, doesn't it?

The code for reproduction is the following:

  1. Train, infer, and save the XGBoost model:
    
    from xgboost import XGBClassifier
    from sklearn.datasets import make_classification
    import numpy as np

X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])

model = XGBClassifier(objective='binary:logistic').fit(X, y) # switch objective to "logitrow" to match results booster = model._Booster

print(model.predict_proba(np.array([[0.0, 0.2, 0.4, 0.6, 0.8]]))) # [[0.37146312 0.6285369 ]]

booster.dump_model("model.txt")

Load the model into FastForest and perform inference:
```cpp
#include "fastforest.h"
#include <iostream>
#include <cmath>

int main() {
    std::vector<std::string> features{"f0",  "f1",  "f2",  "f3",  "f4"};

    const auto fastForest = fastforest::load_txt("model.txt", features);

    std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};

    float score = fastForest(input.data()); // 1.02595
    float sigmoid = 1./(1. + std::exp(-score));

    std::cout << "sigmoid: " << sigmoid << std::endl; // 0.736129
} 

I'm interested in understanding the logistic regression mismatching between XGBoost and FastForest when using the 'binary:logistic' objective. I'm eager to get to the bottom of this issue and would greatly appreciate any help.

guitargeek commented 1 year ago

Hi! I can tell you exactly how to recreate the output values of xgboost 1.7.6:

from xgboost import XGBClassifier
from sklearn.datasets import make_classification
import numpy as np

X, y = make_classification(n_samples=10000, n_features=5, random_state=42, n_classes=2, weights=[0.5])

model_raw = XGBClassifier(objective='binary:logitraw', n_estimators=1, max_depth=1).fit(X, y)
model = XGBClassifier(objective='binary:logistic', n_estimators=1, max_depth=1).fit(X, y)

x = np.array([[0.0, 0.2, 0.4, 0.6, 0.8]])

print("predict (logitraw): ", model_raw.predict(x, output_margin=True)[0])
print("predict (logistic): ", model.predict(x, output_margin=True)[0])
print("predict_proba (logitraw): ", model_raw.predict_proba(x)[0,1])
print("predict_proba (logistic): ", model.predict_proba(x)[0,1])

# Since the fit datasets and evaluation inputs were the same, I would expect
# that `predict_proba` gives the same result in both cases, but it doesn't.
# What does predict_proba even mean for logitraw? In the documentation, it
# says that logitraw is unsupported by predict_proba():
# https://github.com/dmlc/xgboost/blob/master/python-package/xgboost/sklearn.py#L1617
# So why do we get some value that apparently is not equal to the predicted
# probability, and not an error message?

# Also, I would expect that predict(x, output_margin=True) gives the same
# result, as it is the case in this post:
# https://stackoverflow.com/questions/71240580/binarylogitraw-vs-binarylogistic-objective-in-xgboost
# But for us, this is not the case. Maybe there is a new bug in the new XGboost
# version I have (1.7.6)?

booster_raw = model_raw._Booster
booster_raw.dump_model("model_raw.txt")

booster = model._Booster
booster.dump_model("model.txt")

Output:

predict (logitraw):  0.8199206
predict (logistic):  0.44758207
predict_proba (logitraw):  0.8199206
predict_proba (logistic):  0.61006415

Here is the C++ code:

// compile with g++ -o <filename> <filename>.cpp -lfastforest

#include "fastforest.h"
#include <iostream>
#include <cmath>

int main() {
    std::vector<std::string> features{"f0",  "f1",  "f2",  "f3",  "f4"};

    const auto fastForestRaw = fastforest::load_txt("model_raw.txt", features);
    const auto fastForest = fastforest::load_txt("model.txt", features);

    std::vector<float> input{0.0, 0.2, 0.4, 0.6, 0.8};

    // The baseResponse is the value that is added to the tree output.
    // Apparently, xgboost uses different base responses for the two
    // objectives (note that 0.5 is the default in fastforest, which should
    // maybe get changed to 0.0).
    float scoreRaw = fastForestRaw(input.data(), /*baseResponse=*/0.5);
    float score = fastForest(input.data(), /*baseResponse=*/0.0);
    float sigmoid = 1./(1. + std::exp(-score));

    std::cout << "predict (logitraw): " << scoreRaw << std::endl;
    std::cout << "predict (logistic): " << score << std::endl;
    std::cout << "predict_proba (logitraw): " << scoreRaw << std::endl;
    std::cout << "predict_proba (logistic): " << sigmoid << std::endl;
} 

Output:

predict (logitraw): 0.819921
predict (logistic): 0.447582
predict_proba (logitraw): 0.819921
predict_proba (logistic): 0.610064

The remaining differences are because fastforest uses single precision and not double precision, for performance reasons.

As explained in the Python code comments, I have no clue what xgboost is actually doing with this binary:logitraw. The output changes every few xgboost versions. I would expect that you get the same probabilities as with binary:logistic, but this is not the case. Maybe this is worth opening an issue in https://github.com/dmlc/xgboost?

Anyway, is your question answered for the fastforest side with what I wrote?

Cheers! Jonas

andriiknu commented 1 year ago

Thanks for getting back to me quickly! You've given me a very helpful explanation. The scores are aligned now with setting baseResponse to 0.5 for the fastForest inference.
Yes, my question about the fastforest side has been answered now. I'm very grateful for your help!

andriiknu commented 1 year ago

Another thing that might be worthy of mention (but it's not related to this post) is that I need to use float data[4] = {1,2,3,4] instead of auto data = std::vector({1,2,3,4}).data() in my ROOT's Data Frame application. Using std::vector inside ROOT's DataFrame leads to unpredictable behavior. If this is unexpected to you, I can open a separate issue.

guitargeek commented 1 year ago

Watch out, auto data = std::vector({1,2,3,4}).data() is unexpected behavior on it's own! The data() method gives you a pointer to the beginning of the vector, but since the vector only lives as long as the statement is evaluated, the pointer is invalid. This would work:

std::vector<float> vec{1., 2., 3., 4.};
auto data = vec.data(); // at this point why even give this a name, just use `vec.data()` where needed

Could this be the solution to your problem?

andriiknu commented 1 year ago

Ah, I see, good catch! Thank you! Yes, this is the solution.

guitargeek commented 1 year ago

Cool! Then I'll close this issue. Feel free to ask about fastforest also on Mattermost, if you have further questions!