Unclear documentation regarding min_data_in_leaf and min_sum_hessian_in_leaf

brunofacca commented 3 years ago

Hello. Thank you for this great library.

The "Deal with Over-fitting" section of the Parameters Tuning docs says "Use min_data_in_leaf and min_sum_hessian_in_leaf".

Don't those two parameters have the same purpose, except that the former is based only on the number of samples and the latter also takes sample weights into account? That seems to be confirmed by this comment on issue#2258 (which says "you can use min_sum_hessian_in_leaf to replace the min_data_in_leaf, for sample weights.") and issue #2870.
Does it make sense to use both parameters together or to use only one of them?
If it doesn't make sense to use both together, do you have any recommendations about how to choose which one to use (e.g., min_data_in_leaf if not using sample weights or min_sum_hessian_in_leaf if using sample weights)?
If all samples have weight 1.0, does setting min_data_in_leaf=10 have the exact same effect as setting min_sum_hessian_in_leaf=10 or do they have different scales? I ask because their default values are very different (20 vs. 1e-3)?

Thank you.

brunofacca commented 3 years ago

Hi. I assumed it would be appropriate to ask the above questions here because it could lead to an improvement in the documentation. If you disagree, please let me know if there is a better place to ask (e.g., a mailing list). Thank you.

jameslamb commented 3 years ago

Yep this is the right place! This project is maintained by a small group with limited time so sometimes response can be slow.

@shiyu1994 could you help with this?

brunofacca commented 3 years ago

It's impressive what you achieved as a small group. Given how powerful and feature-rich LightGBM is, I would think there is a large team (probably multiple teams) behind it.

brunofacca commented 3 years ago

Sorry for nagging you. Just a friendly reminder in case this issue got lost.

shiyu1994 commented 3 years ago

@brunofacca Thanks for using LightGBM.

First, we should know what min_sum_hessian_in_leaf. It is actually the sum of hessians of the data points in the leaf. hessian of a data point is the second order derivative of the loss function w.r.t. the current prediction value.

Here's an example of how hessian is calculated. Suppose we are training a GBDT model for a binary classification task, and the target is to minimize the binary cross-entropy

$\sum_{i} -y_i \ln p_i - (1 - y_i) \ln (1 - p_i)$ where $p_i = \frac{1}{1 + e^{-\hat{y}_i}}$ and $\hat{y}_i$ is the prediction value (raw score before sigmoid transformation) of LightGBM.

Now we use $\hat{y}_i^m$ for the prediction value of data i after m iterations of boosting. And $\hat{y}_i^{m+1} = \hat{y}_i^m + f_{m+1}(x_i)$ for all m >= 0, where $f_{m+1}$ is the tree to be trained in iteration m+1.

Let $p_i^m = \frac{1}{1 + e^{-\hat{y}_i^m}}$ . And the gradient (actually, the opposite number of the gradient) and hessian of data point i in iteration m+1 are calculated in the following way

$g_i^{m+1} = \frac{\partial l(y_i, \hat{y}_i^m)}{ \partial \hat{y}_i^m } = y_i - p_i^m$ $h_i^{m+1} = \frac{\partial^2 l(y_i, \hat{y}_i^m)}{ (\partial \hat{y}_i^m)^2 } = (1 - p_i^m)p_i^m$

where $l(y_i, \hat{y}_i^m)$ is the loss of data point i, i.e. $l(y_i, \hat{y}_i^m) = -y_i\ln p_i^m - (1 - y_i) \ln (1 - p_i^m)$ .

Since the h_i^{m+1} is always positive, the sum of h_i^{m+1} of all data points in a leaf can sometimes reflect the number of data points in that leaf. But, as the boosting goes on, different data point has different h_i^{m+1}. In binary cross-entropy as the loss, if h_i^{m+1} becomes small, that means the model has relatively high confidence for the prediction of that data point, since p_i^m is quite close to either 0 or 1. While data points whose raw prediction scores y_i^m are close to 0 (which means they are close to the classification boundary of the current model), they have a relatively high h_i^{m+1} value.

Based on the above background, here are the answers of your questions.

Yes, min_data_in_leaf does not consider sample weights, but min_sum_hessian_in_leaf does. Since the hessian values are multiplied by the sample weights before being fed to the tree learner. (Note that sample weights and hessian values are two different concepts!)
Yes, sometimes it makes sense. For example, let's set min_data_in_leaf=200 and min_sum_hessian_in_leaf=20.0. In the beginning of the boosting, the min_data_in_leaf will be the useful constraint, since the booster has low confidence about the predictions, and the hessian values are large enough. So in early iterations, a leaf with data points more than 200 will always has hessian sum greater than 20.0. However, as the boosting goes on, the hessian values becomes smaller, and min_sum_hessian_in_leaf can take into effect.
To sum up, min_sum_hessian_in_leaf provides a more adaptive way of regularization. For objective functions like binary cross-entropy and lambdarank, the magnitude of hessian value can reflect the confidence of the booster's prediction for that data point. Using min_sum_hessian_in_leaf, we allow the tree to perform more flexible split for those data points with low confidence, and less flexible split for those with high confidence. As mentioned in 1, sample weights can only influence min_sum_hessian_in_leaf, but not min_data_in_leaf.
Even if all the sample weights are 1.0, the hessian values of different data points can differ. So in most cases min_sum_hessian_in_leaf and min_data_in_leaf are different. The only exception is when using l2 loss for regression tasks, and without setting sample weights. In that case, all the hessian values will be 1.0, so the two regularization methods are equivalent.

brunofacca commented 3 years ago

Thank you very much for the detailed reply, @shiyu1994! This information is very useful :bowing_man:

jameslamb commented 3 years ago

@shiyu1994 could you add that explanation to https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#for-better-accuracy? I think it's really really good and we should make it easy to discover in the documentation.

shiyu1994 commented 3 years ago

Sure. I can add that.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

microsoft / LightGBM

Unclear documentation regarding min_data_in_leaf and min_sum_hessian_in_leaf #3816