microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.47k stars 3.81k forks source link

Why lambdaRank best gain is -inf? #387

Closed liyi193328 closed 7 years ago

liyi193328 commented 7 years ago

Hi, when I use lambdaRank to predict rank, I make data format carefully according to python package's test file and data format, I can run the test py correctly, but I run my data,

    lgb_model = lgb.LGBMRanker().fit(X_train, y_train,
                                     group=q_train,
                                     eval_set=[(X_test, y_test)],
                                     eval_group=[q_test],
                                     eval_at=[10],
                                     verbose=True,
                                     # callbacks=[lgb.print_evaluation()]
                                     # callbacks=[lgb.reset_parameter(learning_rate=lambda x: 0.95 ** x * 0.1)]
                                     )

resulting to: [LightGBM] [Info] Finished loading parameters [LightGBM] [Info] Loading query boundaries... [LightGBM] [Info] Loading query boundaries... [LightGBM] [Info] Finished loading data in 0.149870 seconds [LightGBM] [Info] Total Bins 4578 [LightGBM] [Info] Number of data: 35076, number of used features: 18 [LightGBM] [Info] Finished initializing training [LightGBM] [Info] Started training... [LightGBM] [Info] No further splits with positive gain, best gain: -inf [LightGBM] [Info] Trained a tree with leaves=1 and max_depth=1 [LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements. [LightGBM] [Info] 0.013264 seconds elapsed, finished iteration 1 [LightGBM] [Info] Finished training

The model file is like this:

tree num_class=1 label_index=0 max_feature_idx=18 objective=lambdarank sigmoid=-1 boost_from_average feature_names=Column_0 Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 feature_infos=none [0:1.000000000000014] [0:1.0000000000000071] [0:1.0000000000000071] [0:1.0000000000000071] [0:1] [0:1.000000000000002] [0:1] [0:1.0000000000000011] [0:1.000000000000002] [0:1.0000000000000011] [0:1.0000000000000011] [0:1.0000000000000011] [0:1.0000000000000011] [0:1.0000000000000071] [0:1.0000000000000071] [0:1.000000000000004] [0:1.000000000000004] [-0.8835864063629788:0.43636767811492733] Tree=0 num_leaves=2 split_feature=0 split_gain=1 threshold=0 decision_type=0 left_child=-1 right_child=-2 leaf_parent=0 0 leaf_value=0.0022795419503482211 0.0022795419503482211 leaf_count=0 35076 internal_value=0 internal_count=35076 shrinkage=1 feature importances: Column_0=1

And I use rankSVM or xgboost's pairwise learning to train the same data and group id, they can get some reasonable predict results,

So what's the reason?

guolinke commented 7 years ago

@liyi193328

  1. it seems you don't use the latest code.
  2. what is your parameters?
  3. the range of label?
liyi193328 commented 7 years ago

@guolinke Thanks, I install lightgbm 3 days ago. the range of label be [-2, 2], float number; train.conf is same as: https://github.com/Microsoft/LightGBM/blob/master/examples/lambdarank/train.conf

guolinke commented 7 years ago

@liyi193328 I see. The label should be int type and >= 0 in lightgbm(for ranking task), and larger means more relevant(better).

liyi193328 commented 7 years ago

@guolinke Thanks, I'll test it

liyi193328 commented 7 years ago

@guolinke problem sovled, Thanks. And I can't find num_boost_round(which is iteration_num or epochs) in sklearn API(neither constructor or fit function have) In train api, there is num_boost_round; And In cmd line, conf file have num_trees; So what 'm I missing? there must some way to set num_iteration in fit function.

liyi193328 commented 7 years ago

@guolinke Just Now, my small data set can run OK, label range is [0, 30], but when I increase label range to [0, 2800], errors happens likes this: [LightGBM] [Info] Finished loading parameters [LightGBM] [Info] Loading query boundaries... [LightGBM] [Fatal] Label excel 0 Met Exceptions: Label excel 0 what's the insight reason? Thanks.

guolinke commented 7 years ago

@liyi193328 Do you really have so many labels for ranking? Refer to https://github.com/Microsoft/LightGBM/blob/master/src/io/config.cpp#L267-L271 . LightGBM use (2^label) as label gain. And the default max_label is 31. So if you use 2800, it cannot be handled.

liyi193328 commented 7 years ago

@guolinke I may scale the rank label. How about the num_boost_round in python sklearn API? Thanks.

guolinke commented 7 years ago

refer to : https://github.com/Microsoft/LightGBM/blob/master/examples/python-guide/sklearn_example.py#L20

The n_estimators, same as sklearn.

github-actions[bot] commented 12 months ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.