Closed omit-ai closed 2 years ago
Hi @Timo3as,
Thanks a lot for the deep research and explanation. It seems you are on to something :).
Background
There are two binary hyper-parameters to consider: Hessian split scores (or not; default is not), Global optimization or divide and conquer (default is divide-and-conquer).
In your example, you are using TF-DF with "Hessian split scores + Global optimization". I think (I might be wrong) that XGBoost uses "Hessian split scores + divide-and-conquer". "divide-and-conquer" is simpler to think about, but this won't change the conclusion.
For TF-DF, those different solutions can be tested as follow in a colab:
import tensorflow_decision_forests as tfdf
import pandas as pd
dataset = pd.DataFrame({"Feature_a":[0,0,1,0],"Feature_b":[1,0,1,1],"label":[0,0,1,0]})
tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="label")
model = tfdf.keras.GradientBoostedTreesModel(validation_ratio=0.0,
num_trees=1,
min_examples=1,
#growing_strategy="BEST_FIRST_GLOBAL",
#use_hessian_gain=True
)
model.fit(tf_dataset)
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=10)
Results:
#growing_strategy="BEST_FIRST_GLOBAL",
#use_hessian_gain=True
-> 1 split
growing_strategy="BEST_FIRST_GLOBAL",
#use_hessian_gain=True
-> 1 split
#growing_strategy="BEST_FIRST_GLOBAL",
use_hessian_gain=True
-> 2 split
growing_strategy="BEST_FIRST_GLOBAL",
use_hessian_gain=True
-> 2 split
About the scores:
When not using Hessian split scores, the score (for a binary classification like in your example) is the reduction of variance:
$$ score = (\sum_{c \in children} variance(c)) - variance(parent) $$
When using Hessian split scores, the score (for a binary classification like in your example) is informally the sum of newtown step value (not taking regularization into account):
$$ (1) score ~= \sum_{children} sum_grad^2/sum_hessian $$
Using the same notation, in XGBoost, the score is (after reading the paper again):
$$ (2) score ~= (\sum_{children} sum_grad^2/sum_hessian) - sum_grad_parent^2/sum_hessian_parent. $$
In addition, in TF-DF, splits with negative scores are ignored.
In your example
If using divide-and-conquer, the "parent term" is the same for all the children. Therefore, the best split is the same in both versions (1) and (2). However, the detection of invalid splits is different (a split might be valid for (1) and invalid for (2)). Actually, a split cannot be invalid in (1) (at least not from the score alone). While I think this difference is visible on small datasets, but less likely to happen on large datasets (to be confirmed).
For global optimization, this is more complex.
Thinking about it, the version (2) makes more sense. I added a hyper-parameter to test it. The results are as follow:
Some other tests
Replacing score formulation (1) -> (2)
In your example.
On the adult dataset, 10 CV, with various combinations of hyper-parameters:
10 folds CV, on 70 small datasets, with various combinations of hyper-parameters:
Conclusion
The formulation (2) makes more sense, and I wonder if this should be the new default or if this should just be a new hyper-parameter. I'll have to do some more experiments to figure it out (why the drop in AUC; is the increase in speed significant; other impacts).
In the meantime, I've added a new advanced hyper-parameter for it hessian_split_score_subtract_parent
, and it will be available in the next release.
I have to thank you again. Your exploratory work was awesome :).
The hessian_split_score_subtract_parent hyper-parameters and the rejection of null splits was released :).
Follow up question of last issue.
Hi again. We looked further into this matter, comparing the implemented algorithms, and I think we found a difference. To this time we are unsure, if this difference is of significance but wanted to share in case this information is useful.
While comparing we were looking at the trees that the algorithms produced. We figured that trees in Yggdrasil were getting way larger than in XGBoost. In order to find out why, we compared the way the two algorithms are growing their trees and shed a light on how the algorithms decide if the split will happen. We think we found the reason on why the trees grow larger.
To get a better view on the internals we used this small dataset:
Splits
Xgboost split
Split 1
Best split candidate:
Split 2
Best split candidate:
Yggdrasil split
Split 1
Best split candidate:
Split 2
Best split candidate:
Explanation
Other than XGBoost Yggdrasil does not seem to carry through the gain of the previous split. It seems to get set to 0 before each new split is evaluated. We think this is why we found Yggdrasil trees to grow larger than XGBoost. Yggdrasil sets the score to 0 on every iteration, while XGBoost tries to find splits that get a better score than the one of the parent node. The NodeCondition is initialized every time the split function is called. And a new NodeCondition defaults to 0. In this case this is no issue but it may or may not be an issue for more complex problems.
Comparison
We tried to find some difference in performance on different datasets. These datasets were small toy datasets so far.
Setup
XGBoost was running on MAC Os Laptop.
TFDF was running von GoogleColab and Multipass.
Multipass:
Datasets
Sklearn - Breast Cancer Dataset
Sklearn - Digit Dataset
Both dataset were splitted with sklearns train_test_split with test_size=0.2 and random_state=42
Comparison
The datasets were loaded with sklearn and compared with different configurations of both models. We tried first to make them as equal as possible.
Parameter
Yggdrasil
XGBoost
Conclusion
So far we are not sure if the different approaches of the algorithms is making a huge difference in performance of the model. After tweaking the hyperparameter Yggdrasil was able to perform equally as good or even better than XGBoost. So I wanted to reach out to you, if you can (un)verify our observation and give further insight.
Thanks a lot and best Regards
Timo