[Discussion] accuracy improvements

guolinke commented 4 years ago

This is to call the accuracy improvement for the tree-based algorithms, include but not limited to:

Better regularization methods that improve generalization abilities.
Better base tree learner, like piece-wise linear trees, to improve the effectiveness of base learner
Better objective functions for special tasks/metrics

If you have any ideas, you can discuss with us here, and open the corresponding issues/pull requests if needed.

MaxHalford commented 4 years ago

There's an idea I've encoutered in the litterature that seems promising, it's called path smoothing.

Trees overfit because they are are too deep and thus their leaves have too few observations in them. When a leaf has too few observations, then the distribution of the target in said leaf is biased and unreliable. The common way to solve is to restrict the maximal depth of the tree, or to apply pruning.

Another less common way is to do path smoothing: calculate a weighted average with respect to each node from root to the leaf for a given observation. Each node thus contributes to the weighted average. There are different weighting schemes that are possible, they just have to capture the fact that the contribution should increase with the number of observations and the depth. What's nice is that this way, even if a leaf has few observations, it's parents will correct the target distribution by smoothing it out. You can think of it as a Bayesian or hierarchical prior.

btrotta commented 4 years ago

@MaxHalford sounds interesting, do you have a reference for this?

MaxHalford commented 4 years ago

Here are some papers which I have managed to find:

Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers: in subsection 2.3 they introduce curtailment, which involves using the first leaf along the path which contains at least k observations. It's therefore very similar to pruning and isn't worth exploring. It's of use to me because I'm researching decision trees for online learning, and this simple heuristic helps a ton.
Learning Classification Trees discusses path smoothing (they call it something else) but the details weren't very clear to me.
Mondrian forests: Efficient online random forests: see section 4.
Decision Trees for Ranking: Effect of newsmoothing methods, new splitting criteria and simple pruning methods: this seems of interest but I haven't read it.

On a side-note, I've been working on a scheme of my own that's giving me very good reasonable and is easy to understand. What I do is that make each node along contribute with a weight which is n_samples * (1 + beta) ** depth, where beta is a positive number. Intuitively, the contribution of a node increases with the number of samples and the depth. The beta parameter controls by how much the depth matters. If a leaf contains no samples, then it's contribution to the weighted average is 0. In that case the other nodes higher up in the tree will be used to estimate the target value.

When I use this scheme, I can virtually increase the maximum depth by how much I want: I never overfit. The path averaging acts a regularizer.

zkurtz commented 4 years ago

Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM.

UPDATE: This looks promising and already builds on LightGBM.

wmiller01 commented 4 years ago

I think the concept of a "soft" decision tree is worth implementing. Soft decision trees learn using recursive partitioning, but instead of generating hard split points, each node probabilistically assigns its children to the left or right path. As a result, soft trees learn smooth/continuous functions of the data as opposed to jagged ones.

The problem this solves is that ordinary decision trees with hard splits generate most of their splits in the dense areas of the input space. When an observation from a sparser part of the space is fed into the model, it typically falls into a very wide area which has a constant prediction. Probabilistic splitting resulting in a continuous function would not have a constant prediction in sparse space, and therefore is likely to generalize better. It seems from the literature that soft trees also seem to generate fewer splits to achieve comparable levels of accuracy, leading to more efficient models.

Here are some resourcesI've found on this concept. https://arxiv.org/abs/1711.09784 https://www.cmpe.boun.edu.tr/~ethem/files/papers/icpr2012_softtree.pdf http://www.cs.cornell.edu/~oirsoy/softtree.html https://towardsdatascience.com/building-a-decision-tree-in-tensorflow-742438cb483e

xuyxu commented 4 years ago

Hi, @wmiller01, I have been playing with soft decision trees (SDT) for a while, and I don't think it is a good idea to include it in LightGBM or any other scalable GBDT system. Pros:

SDT is more powerful than traditional trees like CART. For example, applying 10 SDTs onto the vanilla Gradient Boosting algorithm leads to better performance than LightGBM / XGBoost with hundreds of base decision trees (n_estimators * n_classes).

Cons:

SDT looks more like a neural network with the tree structure. Its training and inference costs are quite large. For example, the number of leaf nodes in SDT increases exponentially with the tree depth
Soft splitting in SDT (e.g., via logistic regression) leads to huge maintenance costs, as it is totally different from the splitting criterion in GBDT using grads and hessians.

guolinke commented 4 years ago

For the SDT, I agree with @AaronX121 . Besides of efficiency, I think the learning of SDT is also very different with DT. The DT is learned by greedy feature selection, while SDT is more like learned by the back-propagation as in neural networks. The greedy feature selection is a great strength of GBDT. With this function, GBDT is very robust to all kinds of tabular data. But replacing it to SDT, it may not outperform original GBDT, just like the NN cannot outperform GBDT in many tabular data.

deadsoul44 commented 3 years ago

...., just like the NN cannot outperform GBDT in many tabular data.

Out of topic but is there any reference to show that NN cannot outperform GBDT in many tabular data? Thank you.

onacrame commented 3 years ago

Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM.

UPDATE: This looks promising and already builds on LightGBM.

This looks like an excellent package and would be good to see similar functionality within LightGBM itself. This looks like it’s capable of dealing with spatial data as well as panel data. Panel data has always been one of the weaknesses of trees due to within group correlation and this seems to deal with that explicitly

shiyu1994 commented 3 years ago

@onacrame Added that to the feature requests & voting hub (#2302)

microsoft / LightGBM

[Discussion] accuracy improvements #2790