microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

performance of lambdarank using custom objective is poor when compared with in-built lambdarank #2239

Closed ii-research-yu closed 4 years ago

ii-research-yu commented 5 years ago

Based on the common used dataset, say MQ2008, the 5-fold CV performance of in-built lambdarank is:

nDCG@1:0.4319, nDCG@3:0.4574, nDCG@5:0.4953, nDCG@10:0.5844

By setting the hessian with constant values, say 1.0, the 5-fold CV performance of manually plug-in lambdarank via the fobj parameter is:

nDCG@1:0.3483, nDCG@3:0.4021, nDCG@5:0.4429, nDCG@10:0.5394

If using the manually computed hessian, the following issue is observed, namely the training can not conducted ...

[10] valid_0's ndcg@1: 0.113095 valid_0's ndcg@2: 0.161686 valid_0's ndcg@3: 0.202154 valid_0's ndcg@4: 0.214604 valid_0's ndcg@5: 0.23572 [20] valid_0's ndcg@1: 0.113095 valid_0's ndcg@2: 0.161686 valid_0's ndcg@3: 0.202154 valid_0's ndcg@4: 0.214604 valid_0's ndcg@5: 0.23572 [30] valid_0's ndcg@1: 0.113095 valid_0's ndcg@2: 0.161686 valid_0's ndcg@3: 0.202154 valid_0's ndcg@4: 0.214604 valid_0's ndcg@5: 0.23572

Environment info

Operating System: ubuntu 16.04

Python version: 3.7

LightGBM version or commit hash: 2.2.3

Any comments on the above results and the following questions are highly appreciated. (1) why there is an obvious difference between in-built lambdarank and manual plug-in lambdarank? Is it due to the inner parameter setting? (2) what are the tips when using manually computed hessian, which seems to be quite sensitive and vulnerable?

guolinke commented 5 years ago

I think it is related to your custom obj, could you provide it? And what is "manually computed hessian", any differences with built-in hessian?

ii-research-yu commented 5 years ago

@guolinke so many thanks for your quick response. I just created a gist snippet https://gist.github.com/y-research-yu/ae9d32dabce52d33073278c2a13dabac

I made the comparison by commenting fobj or not to test the custom objective, e.g., params = {'boosting_type': 'gbdt', # gbdt, dart 'objective': 'lambdarank', 'metric': 'ndcg', 'learning_rate': 0.1, 'num_leaves': 255, 'num_trees': 500, 'num_threads': 16, 'min_data_in_leaf': 0,

'min_sum_hessian_in_leaf': 1.0,

              'verbosity':-1} 

gbm = lgb.train(params=params, train_set=train_set, valid_sets=[valid_set], verbose_eval=10,

fobj=lightgbm_custom_obj_lambdarank

                    )

or gbm = lgb.train(params=params, train_set=train_set, valid_sets=[valid_set], verbose_eval=10, fobj=lightgbm_custom_obj_lambdarank )

Ian09 commented 5 years ago

@y-research-yu hi, I have the same question with you. I have checked your gist and I think the gradient and hessian are correct. But the performance with the your implementation of lambdarank objective is very poor, actually, I the output is:

[LightGBM] [Warning] No further splits with positive gain, best gain: -inf [1] training's ndcg@1: 0.402163 training's ndcg@2: 0.422151 training's ndcg@3: 0.442433 training's ndcg@4: 0.462206 training's ndcg@5: 0.483969 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [2] training's ndcg@1: 0.393314 training's ndcg@2: 0.428606 training's ndcg@3: 0.448456 training's ndcg@4: 0.477228 training's ndcg@5: 0.501381 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [3] training's ndcg@1: 0.418879 training's ndcg@2: 0.454627 training's ndcg@3: 0.473396 training's ndcg@4: 0.50212 training's ndcg@5: 0.523036 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [4] training's ndcg@1: 0.420846 training's ndcg@2: 0.458612 training's ndcg@3: 0.476477 training's ndcg@4: 0.501127 training's ndcg@5: 0.521555 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [5] training's ndcg@1: 0.432645 training's ndcg@2: 0.459553 training's ndcg@3: 0.480824 training's ndcg@4: 0.503187 training's ndcg@5: 0.522683 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements [6] training's ndcg@1: 0.432645 training's ndcg@2: 0.459553 training's ndcg@3: 0.480824 training's ndcg@4: 0.503187 training's ndcg@5: 0.522683 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements [7] training's ndcg@1: 0.432645 training's ndcg@2: 0.459553 training's ndcg@3: 0.480824 training's ndcg@4: 0.503187 training's ndcg@5: 0.522683 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements [8] training's ndcg@1: 0.432645 training's ndcg@2: 0.459553 training's ndcg@3: 0.480824 training's ndcg@4: 0.503187 training's ndcg@5: 0.522683 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements [9] training's ndcg@1: 0.432645 training's ndcg@2: 0.459553 training's ndcg@3: 0.480824 training's ndcg@4: 0.503187 training's ndcg@5: 0.522683 [LightGBM] [Warning] No further splits with positive gain, best gain: -inf [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements [10] training's ndcg@1: 0.432645 training's ndcg@2: 0.459553 training's ndcg@3: 0.480824 training's ndcg@4: 0.503187 training's ndcg@5: 0.522683

Did you have any advance on this issue? Thanks very much

ii-research-yu commented 5 years ago

@Ian09 Hi, thanks for the message. Unfortunately, I got no advance on this issue. @guolinke Any inspiring comments are highly appreciated, thanks!

guolinke commented 5 years ago

@y-research-yu I think there are some bugs in your code. you can check the grad/hess numerical issues, such as nan/inf/zeros.

For your questions, the built-in lambdarank only have a slightly difference with the original algorithm. you can check ndcg_norm parameter.

BTW, you can set 'verbosity':2 for more information.

Ian09 commented 5 years ago

Hi @guolinke and @y-research-yu I have checked the hess and I found that the hess is very small. For example, in my data, there are 7903 items, and the mean, sum, min, max of the hess for the first epoch(tree) is -2.2476994045302423e-19,-1.7763568394002505e-15, -1.0848717237770573 6.157482893792464 and there is one items which is zero. Is it correct/normal for the hess?

BTW, I have checked @y-research-yu 's implementation, and I think the hess computation is correct, which is epsilon^2 * (sigmoid(s_{ij})*(1-sigmoid(s_{ij})

guolinke commented 5 years ago

@Ian09 thanks! the small hessian may cause the problem. The outputs of tree leaves is sum_grad/sum_hess.

@y-research-yu BTW, the hessian should be a positive number, since epsilon^2 * (sigmoid(s_{ij})*(1-sigmoid(s_{ij}) is the multiplication of 3 positive numbers. So I think there should be some bugs in the your code.

guolinke commented 5 years ago

@y-research-yu I think the following code may be wrong: https://gist.github.com/y-research-yu/ae9d32dabce52d33073278c2a13dabac#file-lightgbm_custom_obj_lambdarank-py-L83-L91

The hessian should be always accumulated, not need the lambda_ji_2order. refer to https://github.com/microsoft/LightGBM/blob/df26b65d786532b88b13d13cf8e4b7ac9fa8d8cd/src/objective/rank_objective.hpp#L156-L165

akurennoy commented 5 years ago

@guolinke Hi, the fact that the hessian is always accumulated is a difference from the formulation in the original paper (https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf), right? As far as I understand, the last formula on page 16 (Section 7 LambdaMART) means that the hessian is added when the label of the item is higher and subtracted when it's lower. Or am I missing something?

If I'm correct, is it just a heuristic or is there some reasoning behind the difference (i.e. always accumulating the hessian)?

Thanks!

guolinke commented 5 years ago

Thanks @akurennoy , As the learning is mainly depending on gradients. I think the hessian (like the weight) should be always positive. For an intuitive example, for the low label doc with the highest score, its gradient will be negative, if its hessian is also negative (neg / neg = pos), the doc will not be pushed backward. As for the equation in the paper, I will check it carefully.

akurennoy commented 5 years ago

Thank you very much for a quick reply. Makes sense. Apparently, designing the algorithm in full analogy with the Newton method wouldn't be a good idea here. (It's worth noting that the Newton method can "spoil" the gradient direction in classical optimization as well when the current point is far from a solution.)

guolinke commented 4 years ago

In the newton step for GBDT, the loss could be expanded by second-order Taylor expansion. Then the loss is a quadratic function, which reaches to the minimal point when x=-b/2a and a > 0. here a = sum_hessian / 2, b = sum_gradient.
Therefore, the sum_hessian should be always positive, at least constant, otherwise the newton step theory is broken.

also refer to https://xgboost.readthedocs.io/en/latest/tutorials/model.html#the-structure-score