google / yggdrasil-decision-forests

A library to train, evaluate, interpret, and productionize decision forest models such as Random Forest and Gradient Boosted Decision Trees.
https://ydf.readthedocs.io/
Apache License 2.0
447 stars 49 forks source link

GradientBoostedTreesLearner Model Outputs Inverted Predictions #100

Closed rlcauvin closed 1 week ago

rlcauvin commented 2 weeks ago

I have trained several binary classification models using the GradientBoostedTreesLearner. The value of the target column (label) is either 0 or 1, where 1 is the desired outcome. Typically, the models output the probability that the value will be 1. But one of my models outputs the probability that the value will be 0. In the case of training that model, 1 was the majority class (there were more labels in the training data with the value 1 than with the value 0).

I'm guessing the learner somehow infers the desired outcome in binary classification, and perhaps one of its criteria is which class is the minority class.

I tried retraining the model after converting the label values to True and False boolean values (i.e. 1 becomes True and 0 becomes False) instead, and the model now outputs the probability that the value will be 1. So maybe when the learner sees boolean values for the labels, it assumes True represents the desired outcome?

Is there a way of specifying the desired outcome label for binary classification problems?

achoum commented 2 weeks ago

Hi Roger,

There is (currently) no way for the user to specify manually the order of the values / columns for classification labels. However, understanding the logic in place should help.

1. As you noted correctly, when the values are boolean, false/true as always mapped to 0/1 respectively.

2. As you noted correctly, in all other cases, values are sorted by decreasing frequency count in the training dataset.

This was improved, in https://github.com/google/yggdrasil-decision-forests/commit/3aec76ea19dff886dc2e9a6656959bb09a02cd5d . After this commit, if the values are integers, start at zero and are "dense", they will be mapped to themself (which is what you want). This commit will be included in the june release that will be released in a few days. In the meantime, you can to use boolean :).

3. Another temporary solution is to ues the label_classes function. This function returns the ordered list of labels. You can use it to select your prediction column of interest.

I hope this help.

achoum commented 1 week ago

Solved in 0.5.0 release.

TonyCongqianWang commented 1 week ago

I was confused by the same thing so its great to have this clarified in this issue. What I am still confused about is the fact that the model was trained for a binary classification, and yet there seems to be no way to output a binary prediction, or am I missing something? It is nice to get the probability values for post processing, but I would expect there to be a way to just output 0 or 1 predictions. What is the threshold used for the confusion matrix? 0.5?