Open guolinke opened 4 years ago
There's an idea I've encoutered in the litterature that seems promising, it's called path smoothing.
Trees overfit because they are are too deep and thus their leaves have too few observations in them. When a leaf has too few observations, then the distribution of the target in said leaf is biased and unreliable. The common way to solve is to restrict the maximal depth of the tree, or to apply pruning.
Another less common way is to do path smoothing: calculate a weighted average with respect to each node from root to the leaf for a given observation. Each node thus contributes to the weighted average. There are different weighting schemes that are possible, they just have to capture the fact that the contribution should increase with the number of observations and the depth. What's nice is that this way, even if a leaf has few observations, it's parents will correct the target distribution by smoothing it out. You can think of it as a Bayesian or hierarchical prior.
@MaxHalford sounds interesting, do you have a reference for this?
Here are some papers which I have managed to find:
k
observations. It's therefore very similar to pruning and isn't worth exploring. It's of use to me because I'm researching decision trees for online learning, and this simple heuristic helps a ton.On a side-note, I've been working on a scheme of my own that's giving me very good reasonable and is easy to understand. What I do is that make each node along contribute with a weight which is n_samples * (1 + beta) ** depth
, where beta
is a positive number. Intuitively, the contribution of a node increases with the number of samples and the depth. The beta
parameter controls by how much the depth matters. If a leaf contains no samples, then it's contribution to the weighted average is 0. In that case the other nodes higher up in the tree will be used to estimate the target value.
When I use this scheme, I can virtually increase the maximum depth by how much I want: I never overfit. The path averaging acts a regularizer.
Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM.
UPDATE: This looks promising and already builds on LightGBM.
I think the concept of a "soft" decision tree is worth implementing. Soft decision trees learn using recursive partitioning, but instead of generating hard split points, each node probabilistically assigns its children to the left or right path. As a result, soft trees learn smooth/continuous functions of the data as opposed to jagged ones.
The problem this solves is that ordinary decision trees with hard splits generate most of their splits in the dense areas of the input space. When an observation from a sparser part of the space is fed into the model, it typically falls into a very wide area which has a constant prediction. Probabilistic splitting resulting in a continuous function would not have a constant prediction in sparse space, and therefore is likely to generalize better. It seems from the literature that soft trees also seem to generate fewer splits to achieve comparable levels of accuracy, leading to more efficient models.
Here are some resourcesI've found on this concept. https://arxiv.org/abs/1711.09784 https://www.cmpe.boun.edu.tr/~ethem/files/papers/icpr2012_softtree.pdf http://www.cs.cornell.edu/~oirsoy/softtree.html https://towardsdatascience.com/building-a-decision-tree-in-tensorflow-742438cb483e
Hi, @wmiller01, I have been playing with soft decision trees (SDT) for a while, and I don't think it is a good idea to include it in LightGBM or any other scalable GBDT system. Pros:
Cons:
For the SDT, I agree with @AaronX121 . Besides of efficiency, I think the learning of SDT is also very different with DT. The DT is learned by greedy feature selection, while SDT is more like learned by the back-propagation as in neural networks. The greedy feature selection is a great strength of GBDT. With this function, GBDT is very robust to all kinds of tabular data. But replacing it to SDT, it may not outperform original GBDT, just like the NN cannot outperform GBDT in many tabular data.
...., just like the NN cannot outperform GBDT in many tabular data.
Out of topic but is there any reference to show that NN cannot outperform GBDT in many tabular data? Thank you.
Hierarchically clustered observations (such as data that involve repeated measurements of individual subjects) present a challenge for GBMs. The worst case is when all other predictors are constant within-subject. Including the subject ID as a predictor leads to all other predictors being ignored, making it impossibly to predict on a new subject. Even if you exclude the subject ID from the prediction features, the GBM can overfit on the other features, identifying subjects implicitly with deep trees, effectively overfitting to individuals and ingnoring the population-level correlations that are needed to generalize to new subjects. Here's a very nice discussion of the issue with links to numerous recent proposals that could be adapted for LightGBM.
UPDATE: This looks promising and already builds on LightGBM.
This looks like an excellent package and would be good to see similar functionality within LightGBM itself. This looks like it’s capable of dealing with spatial data as well as panel data. Panel data has always been one of the weaknesses of trees due to within group correlation and this seems to deal with that explicitly
@onacrame Added that to the feature requests & voting hub (#2302)
This is to call the accuracy improvement for the tree-based algorithms, include but not limited to:
If you have any ideas, you can discuss with us here, and open the corresponding issues/pull requests if needed.