Closed brunofacca closed 3 years ago
Hi. I assumed it would be appropriate to ask the above questions here because it could lead to an improvement in the documentation. If you disagree, please let me know if there is a better place to ask (e.g., a mailing list). Thank you.
Yep this is the right place! This project is maintained by a small group with limited time so sometimes response can be slow.
@shiyu1994 could you help with this?
It's impressive what you achieved as a small group. Given how powerful and feature-rich LightGBM is, I would think there is a large team (probably multiple teams) behind it.
Sorry for nagging you. Just a friendly reminder in case this issue got lost.
@brunofacca Thanks for using LightGBM.
First, we should know what min_sum_hessian_in_leaf
. It is actually the sum of hessians
of the data points in the leaf. hessian
of a data point is the second order derivative of the loss function w.r.t. the current prediction value.
Here's an example of how hessian
is calculated. Suppose we are training a GBDT model for a binary classification task, and the target is to minimize the binary cross-entropy
where and is the prediction value (raw score before sigmoid transformation) of LightGBM.
Now we use for the prediction value of data i
after m
iterations of boosting. And for all m >= 0
, where is the tree to be trained in iteration m+1
.
Let . And the gradient (actually, the opposite number of the gradient) and hessian of data point i
in iteration m+1
are calculated in the following way
where is the loss of data point i
, i.e. .
Since the h_i^{m+1}
is always positive, the sum of h_i^{m+1}
of all data points in a leaf can sometimes reflect the number of data points in that leaf. But, as the boosting goes on, different data point has different h_i^{m+1}
. In binary cross-entropy as the loss, if h_i^{m+1}
becomes small, that means the model has relatively high confidence for the prediction of that data point, since p_i^m
is quite close to either 0
or 1
. While data points whose raw prediction scores y_i^m
are close to 0 (which means they are close to the classification boundary of the current model), they have a relatively high h_i^{m+1}
value.
Based on the above background, here are the answers of your questions.
min_data_in_leaf
does not consider sample weights, but min_sum_hessian_in_leaf
does. Since the hessian values are multiplied by the sample weights before being fed to the tree learner. (Note that sample weights and hessian values are two different concepts!)min_data_in_leaf=200
and min_sum_hessian_in_leaf=20.0
. In the beginning of the boosting, the min_data_in_leaf
will be the useful constraint, since the booster has low confidence about the predictions, and the hessian values are large enough. So in early iterations, a leaf with data points more than 200
will always has hessian sum greater than 20.0
. However, as the boosting goes on, the hessian values becomes smaller, and min_sum_hessian_in_leaf
can take into effect.min_sum_hessian_in_leaf
provides a more adaptive way of regularization. For objective functions like binary cross-entropy and lambdarank, the magnitude of hessian value can reflect the confidence of the booster's prediction for that data point. Using min_sum_hessian_in_leaf
, we allow the tree to perform more flexible split for those data points with low confidence, and less flexible split for those with high confidence.
As mentioned in 1, sample weights can only influence min_sum_hessian_in_leaf
, but not min_data_in_leaf
.1.0
, the hessian values of different data points can differ. So in most cases min_sum_hessian_in_leaf
and min_data_in_leaf
are different. The only exception is when using l2
loss for regression tasks, and without setting sample weights. In that case, all the hessian values will be 1.0
, so the two regularization methods are equivalent.Thank you very much for the detailed reply, @shiyu1994! This information is very useful :bowing_man:
@shiyu1994 could you add that explanation to https://lightgbm.readthedocs.io/en/latest/Parameters-Tuning.html#for-better-accuracy? I think it's really really good and we should make it easy to discover in the documentation.
Sure. I can add that.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
Hello. Thank you for this great library.
The "Deal with Over-fitting" section of the Parameters Tuning docs says "Use
min_data_in_leaf
andmin_sum_hessian_in_leaf
".Don't those two parameters have the same purpose, except that the former is based only on the number of samples and the latter also takes sample weights into account? That seems to be confirmed by this comment on issue#2258 (which says "you can use
min_sum_hessian_in_leaf
to replace themin_data_in_leaf
, for sample weights.") and issue #2870.Does it make sense to use both parameters together or to use only one of them?
If it doesn't make sense to use both together, do you have any recommendations about how to choose which one to use (e.g.,
min_data_in_leaf
if not using sample weights ormin_sum_hessian_in_leaf
if using sample weights)?If all samples have weight 1.0, does setting
min_data_in_leaf=10
have the exact same effect as settingmin_sum_hessian_in_leaf=10
or do they have different scales? I ask because their default values are very different (20 vs. 1e-3)?Thank you.