XGBoost and FastForest predictions don't match

farjasju commented 4 years ago

Olá! I'm trying to make predictions with FastForest, but the outputs don't match the XGBoost ones, and I can't figure out why. Here are 10 samples and - respectively - their XGBoost raw prediction, XGBoost prediction with logistic transformation, FastForest raw prediction and FastForest prediction with logistic transformation.

	XGBoost	LT	FastForest	LT
sample 1	4.6465325	0.990	1.39314	0.801
sample 2	4.5409245	0.989	1.39692	0.801
sample 3	4.5436025	0.989	1.7282	0.849
sample 4	4.6465325	0.990	1.70365	0.846
sample 5	3.681776	0.975	0.09692	0.524
sample 6	4.644615	0.990	1.44151	0.808
sample 7	4.6465325	0.990	1.30975	0.787
sample 8	4.6402144	0.990	1.26588	0.780
sample 9	4.6465325	0.990	1.59832	0.831
sample 10	4.2298365	0.985	0.644576	0.655

The figures have been computed as follows:

XGBoost raw output:

# 'binary:logitraw' model
raw_output = model.predict_proba(sample)[0][1]

XGBoost probability output:

probability = 1 / (1 + math.exp(-raw_output))

FastForest raw output:

float raw_output = fastForest(input.data(sample));

FastForest probability output:

float probability = 1. / (1. + std::exp(-(raw_output)));

Do you have any clue? Thanks!

guitargeek commented 4 years ago

Hi @farjasju, sorry for the late answer but unfortunately I don't seem to get email notifications when issues get opened here. I need to do something about that.

From the code you are sharing it seems you are doing everything correctly. Do you have some complete code so I can reproduce the problem? Right now I can only say that you might have done something wrong with the order of the features that you pass to the classifier. Unfortunately, the original order of the features as you pass them to the python model is not stored in the text dump from xgboost, so the fastforest class can't recover it. There is more in the README about this. Did you consider that?

Cheers!

guitargeek commented 4 years ago

Only this looks a bit strange:

float raw_output = fastForest(input.data(sample));

I don't know how you organize your input data, but if input is a vector like in the README, this will not work. Maybe Your problem is related to that? Did you check that the input features are the same for the xgboost and fastforest cases?

farjasju commented 4 years ago

Hi and thanks for the answer!

Concerning the features, I'm passing the default ones to FastForest, ie. f0, f1, f2, f3... (as they appear in the .txt dump file of the model).
Concerning the input data, I have a std::vector<float> input = {}; for each sample that I fill in from a csv. Isn't fastForest(input.data()) the right way to get the prediction?

Here is my (simplified) code:

// Initializing FastForest
std::vector<std::string> features{"f0", "f1", "f2", "f3", "f4", "f5", ... "f304", "f305", "f306", "f307", "f308", "f309"};
FastForest fastForest("../xgb_trained_model.txt", features);

// Reading the csv dataset
std::ifstream csv_input("../dataset.csv");
std::string delimiter = ",";
for (std::string line; getline(csv_input, line);)
{
  std::vector<float> input = {}; // the sample vector
  size_t pos = 0;
  std::string token;
  while ((pos = line.find(delimiter)) != std::string::npos)
  {
    token = line.substr(0, pos);
    input.push_back(::atof(token.c_str()));
    line.erase(0, pos + delimiter.length());
  }
  // Predicting the probability
  float output = fastForest(input.data());
  float score = 1. / (1. + std::exp(-(output)));
}

guitargeek commented 4 years ago

How does your csv file look like? Taking a quick glance at your code, it seems to me that the last feature in each row of the CSV is not read, if a row looks like I would expect (with no trailing comma):

f0,f1,..,f309

I think your code only adds a feature to the input vector while there is a comma after the value, hence you would miss the last one. Did you check that input.length() is as you would expect?

Unfortunately fastforest doesn't check if the number of passed feature is correct because it just takes an array pointer as an input. I should change the interface such that the number of passed features is validated.

farjasju commented 4 years ago

Well, you're totally right, it's missing the last feature. Thank you, I'm not used to working in C++.

Oddly, this doesn't seem to solve the problem and outputs are still between 0 and 2 where they are around 4 with XGBoost...

guitargeek commented 4 years ago

I'm sorry, in that case I can't spot the problem based on the code you shared. One more thing that I can think of concerning the csv file: maybe you exported it with the pandas to_csv() method with the default parameters? In that case, each row would start with an index instead of the first feature, which would give you the wrong result. Other than that, I'm afraid I can't help you further without having the files to reproduce the problem myself. And I would very much like to solve the problem, because an open issue "XGBoost and FastForest predictions don't match" is not good advertising for my library, even if it potentially is the users fault :)

farjasju commented 4 years ago

I understand and I apologize for this bad advertising! I uploaded my files to this repo if you want to reproduce the problem. The csv has been generated by myself in a really simple way.

guitargeek commented 4 years ago

Forget what I wrote here in the beginning, it turned out that the original fastforest behaviour was correct and that the < instead of a <= sign in the text dump from xgboost is incorrect and was misleading me.

Thanks a lot for that. I see you have also many categorical features in there that only take integer values. That helped me to spot a trivial bug in fastforest, which had the wrong behaviour when a feature was equal to the cut value: ~~https://github.com/guitargeek/XGBoost-FastForest/commit/4e9814f838709c63da27c3700ea6012b65639a93#diff-e7d15b6e1cd8e66791b5ad8ca099fc81R211~~

~~A new test case was added for this as well.~~

~~Independently from this little bug,~~ I can't reproduce the problem that you describe:

Oddly, this doesn't seem to solve the problem and outputs are still between 0 and 2 where they are around 4 with XGBoost...

However, when I run your code from the repository I get values around 4 like you seem to expect (here the first 10 lines):

label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989892, primary output:4.584249)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.989892, primary output:4.584249)
label: 1 - score: 1 (0.989035, primary output:4.502018)
label: 1 - score: 1 (0.989885, primary output:4.583563)
label: 1 - score: 1 (0.987967, primary output:4.408020)
label: 1 - score: 1 (0.985730, primary output:4.235198)
label: 1 - score: 1 (0.989885, primary output:4.583563)

Maybe the problem is just that the rows are ordered differently when you compare to your python output? Would it be possible to add also a binary file of the xgboost model in your repository such that I can load the model in python and do the comparison? You know, the file that you get as follows:

model = XGBClassifier(n_estimators=100, objective="binary:logitraw").fit(X, y)
model._Booster.save_model("model.bin")

Cheers!

guitargeek commented 4 years ago

Okay I could reproduce the problem myself by taking your csv file (which has actually 311 columns, so be careful with the feature names which should go up to f310) and train an xgboost classifier with that. This got implemented in a unit test now: https://github.com/guitargeek/XGBoost-FastForest/blob/master/test/create_test_data.py#L43 https://github.com/guitargeek/XGBoost-FastForest/blob/master/test/test.cpp#L96

I took your file 1000_samples.csv from your repo, compressed it and renamed it to manyfeatures.csv.gz for the unit test, I hope that was ok.

Being able to reproduce the disagreement, it got apparent what the problem was: fastforest stored the indices for the features used a each split in an unsigned char, which is of course very efficient but did not allow for more than 256 features. In this issue, we were dealing with 311 features.

The solution was is of course to change it to unsigned int, which causes the unit test based on you problem to pass. Please let me know if this works for you such that I can close the issue.

guitargeek commented 4 years ago

Hi @farjasju, does it work for you now? I would like to close the issue as soon as possible.

guitargeek commented 4 years ago

Issue solved by not having a maximum number of 256 features anymore in commit https://github.com/guitargeek/XGBoost-FastForest/commit/3fc24b4731d3660710bc147a9ef32eb1cd509fe7.

@farjasju, feel very free to open another issue if you have further problems!

guitargeek / XGBoost-FastForest

XGBoost and FastForest predictions don't match #3