Follow up question of last issue.

Hi again. We looked further into this matter, comparing the implemented algorithms, and I think we found a difference. To this time we are unsure, if this difference is of significance but wanted to share in case this information is useful.

While comparing we were looking at the trees that the algorithms produced. We figured that trees in Yggdrasil were getting way larger than in XGBoost. In order to find out why, we compared the way the two algorithms are growing their trees and shed a light on how the algorithms decide if the split will happen. We think we found the reason on why the trees grow larger.

To get a better view on the internals we used this small dataset:

Index	Feature_a	Feature_b	Label
0	0	1	0
1	0	0	0
2	1	1	1
3	0	1	0

Splits

Xgboost split

Split 1

Best split candidate:

Feature: Feature_a
Root Gain: 1
Index Left: 2
Index Right: 0, 1, 3
Score Left: 1
Score Right: 3
Split happening: Yes

Split 2

Index	Feature_a	Feature_b	Label
0	0	1	0
1	0	0	0
2	0	1	0

Best split candidate:

Feature: Feature_b
Root Gain: 3
Index Left: 0, 2
Index Right: 1
Score Left: 2
Score Right: 1
Split happening: No

Yggdrasil split

Split 1

Best split candidate:

Feature: Feature_a
Root Gain: 0
Index Left: 2
Index Right: 0, 1, 3
Score Left: 1
Score Right: 3
Split happening: Yes

Split 2

Index	Feature_a	Feature_b	Label
0	0	1	0
1	0	0	0
2	0	1	0

Best split candidate:

Feature: Feature_b
Root Gain: 0
Index Left: 0, 2
Index Right: 1
Score Left: 0.666667
Score Right: 0.33333
Split happening: Yes

Explanation

Other than XGBoost Yggdrasil does not seem to carry through the gain of the previous split. It seems to get set to 0 before each new split is evaluated. We think this is why we found Yggdrasil trees to grow larger than XGBoost. Yggdrasil sets the score to 0 on every iteration, while XGBoost tries to find splits that get a better score than the one of the parent node. The NodeCondition is initialized every time the split function is called. And a new NodeCondition defaults to 0. In this case this is no issue but it may or may not be an issue for more complex problems.

Comparison

We tried to find some difference in performance on different datasets. These datasets were small toy datasets so far.

Setup

XGBoost was running on MAC Os Laptop.

2,6 GHz 6-Core Intel Core i7
16 GB 2667 MHz DDR4
AMD Radeon Pro 5300M 4 GB Intel UHD Graphics 630 1536 MB

TFDF was running von GoogleColab and Multipass.

Multipass:

Ubuntu 20.04.3 LTS
14.4G Disk
3.8G Memory

Datasets

Sklearn - Breast Cancer Dataset

Classes: 2
Samples per class: 212 (M), 357 (B)
Samples total: 569
Dimensionality: 30
Features: real, positive

Sklearn - Digit Dataset

Classes: 10
Samples per class: 180
Samples total: 1797
Dimensionality: 64
Features: integers 0-16

Both dataset were splitted with sklearns train_test_split with test_size=0.2 and random_state=42

Comparison

The datasets were loaded with sklearn and compared with different configurations of both models. We tried first to make them as equal as possible.

Parameter

Yggdrasil

dt_kwargs_base = {
    'num_trees':1000,
    'growing_strategy':"BEST_FIRST_GLOBAL",
    'max_depth':6,
    'use_hessian_gain':True,
    'sorting_strategy':"IN_NODE",
    'shrinkage':1.,
    'subsample':1.,
    'sampling_method': 'RANDOM',
    'l1_regularization':1.,
    'l2_regularization':1.,
    'l2_categorical_regularization':1.,
    'num_candidate_attributes': -1,
    'num_candidate_attributes_ratio': -1.,
    'min_examples':1,
    'validation_ratio':0.,
    'early_stopping':"NONE",
    'in_split_min_examples_check':False,
    'max_num_nodes': -1,
    'verbose': 0,
}

XGBoost

gb_kwargs = {
        'n_estimators': 1000,
        'max_depth': 6,
        'colsample_bytree': 1.,
        'colsample_bynode': 1.,
        'colsample_bylevel': 1.,
        'use_label_encoder': False,
        'reg_lambda': 1.,
        'reg_alpha': 1.,
        'min_child_weight': 0,
        'min_split_loss': 0,
        'max_delta_step': 0,
        'base_score': 0.5,
        'learning_rate': 1.,
        'tree_method': 'exact',
        'booster': 'gbtree',
        'nthread': 1,
        'eval_metric': eval_metric,
        'objective': objective,
        'subsample': 1,
        'verbosity': 0,
        'validate_parameters': False,
        'scale_pos_weight': 1,
        'refresh_leaf': 1,
        #'early_stopping_rounds': 10,
    }

Model/Dataset	Breast Cancer	Digits
XGBoost	0.1072	0.1606
Yggdrasil	0.0944	0.2350

Conclusion

So far we are not sure if the different approaches of the algorithms is making a huge difference in performance of the model. After tweaking the hyperparameter Yggdrasil was able to perform equally as good or even better than XGBoost. So I wanted to reach out to you, if you can (un)verify our observation and give further insight.

Thanks a lot and best Regards

Timo

Hi @Timo3as,

Thanks a lot for the deep research and explanation. It seems you are on to something :).

Background

There are two binary hyper-parameters to consider: Hessian split scores (or not; default is not), Global optimization or divide and conquer (default is divide-and-conquer).

In your example, you are using TF-DF with "Hessian split scores + Global optimization". I think (I might be wrong) that XGBoost uses "Hessian split scores + divide-and-conquer". "divide-and-conquer" is simpler to think about, but this won't change the conclusion.

For TF-DF, those different solutions can be tested as follow in a colab:

import tensorflow_decision_forests as tfdf
import pandas as pd

dataset = pd.DataFrame({"Feature_a":[0,0,1,0],"Feature_b":[1,0,1,1],"label":[0,0,1,0]})

tf_dataset = tfdf.keras.pd_dataframe_to_tf_dataset(dataset, label="label")
model = tfdf.keras.GradientBoostedTreesModel(validation_ratio=0.0,
                                             num_trees=1,
                                             min_examples=1,
                                             #growing_strategy="BEST_FIRST_GLOBAL",
                                             #use_hessian_gain=True
                                             )
model.fit(tf_dataset)
tfdf.model_plotter.plot_model_in_colab(model, tree_idx=0, max_depth=10)

Results:

#growing_strategy="BEST_FIRST_GLOBAL",
#use_hessian_gain=True
-> 1 split

growing_strategy="BEST_FIRST_GLOBAL",
#use_hessian_gain=True
-> 1 split

#growing_strategy="BEST_FIRST_GLOBAL",
use_hessian_gain=True
-> 2 split

growing_strategy="BEST_FIRST_GLOBAL",
use_hessian_gain=True
-> 2 split

About the scores:

When not using Hessian split scores, the score (for a binary classification like in your example) is the reduction of variance:

$$ score = (\sum_{c \in children} variance(c)) - variance(parent) $$

When using Hessian split scores, the score (for a binary classification like in your example) is informally the sum of newtown step value (not taking regularization into account):

$$ (1) score ~= \sum_{children} sum_grad^2/sum_hessian $$

Using the same notation, in XGBoost, the score is (after reading the paper again):

$$ (2) score ~= (\sum_{children} sum_grad^2/sum_hessian) - sum_grad_parent^2/sum_hessian_parent. $$

In addition, in TF-DF, splits with negative scores are ignored.

In your example

If using divide-and-conquer, the "parent term" is the same for all the children. Therefore, the best split is the same in both versions (1) and (2). However, the detection of invalid splits is different (a split might be valid for (1) and invalid for (2)). Actually, a split cannot be invalid in (1) (at least not from the score alone). While I think this difference is visible on small datasets, but less likely to happen on large datasets (to be confirmed).

For global optimization, this is more complex.

Thinking about it, the version (2) makes more sense. I added a hyper-parameter to test it. The results are as follow:

Some other tests

Replacing score formulation (1) -> (2)

In your example.
- There is always only one split (better).
On the adult dataset, 10 CV, with various combinations of hyper-parameters:
- -2% of nodes with divide-and-conquer.
- -29% nodes with global optimization.
- No impact in ACC, AUC.
- Small increase in training time.
10 folds CV, on 70 small datasets, with various combinations of hyper-parameters:
- No impact on accuracy.
- Slight negative impact on AUC.

Conclusion

The formulation (2) makes more sense, and I wonder if this should be the new default or if this should just be a new hyper-parameter. I'll have to do some more experiments to figure it out (why the drop in AUC; is the increase in speed significant; other impacts).

In the meantime, I've added a new advanced hyper-parameter for it hessian_split_score_subtract_parent, and it will be available in the next release.

I have to thank you again. Your exploratory work was awesome :).

google / yggdrasil-decision-forests

Follow Up Question #16

Splits

Xgboost split

Split 1

Split 2

Yggdrasil split

Split 1

Split 2

Explanation

Comparison

Setup

Datasets

Sklearn - Breast Cancer Dataset

Sklearn - Digit Dataset

Comparison

Parameter

Yggdrasil

XGBoost

Conclusion