LightGBMRanker stopped training (No further splits with positive gain)

ElysiumFan086 commented 4 years ago

@imatiach-msft I have encountered some trouble when training Lambda rank model in Spark with LightGBMRanker in mmlspark.

With the same training data, training results in Spark and local machine are different:

In Spark with mmlspark, training log is shown as below:

[LightGBM] [Info] Total Bins 15570
[LightGBM] [Info] Number of data: 6931159, number of used features: 101
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] Stopped training because there are no more leaves that meet the split requirements

In my local machine, where LightGBM is installed with conda, training work finished successfully.

I am not sure what makes above difference. I have tried to modify some parameters like maxPosition and partition number in spark, which had proved to be of no use. And also setting min_data_in_leaf smaller was also proposed, but I didn't find any access to set this parameter in mmlspark interface of LightGBMRanker.

If anyone had experience of solving similar problem, it will be very appreciate of you to give me some advice.

Following lists some information, which may be helpful to diagnose what happened:

parameters used to initialize the ranker:

    model = LightGBMRanker(
        parallelism='data_parallel',
        objective='lambdarank',
        boostingType='gbdt',
        numIterations=500,
        learningRate=0.1,
        numLeaves=1023,
        maxDepth=10,
        earlyStoppingRound=0,
        maxPosition=8,
        minSumHessianInLeaf=0.001,
        lambdaL1=0.01,
        lambdaL2=0.01,
        isProvideTrainingMetric=True,
        defaultListenPort=49650,
        featuresCol='features',
        groupCol='query_id',
        labelCol='label',
        numBatches=0,
        timeout=600000.0,
        verbosity=1).fit(df)

Model saved on HDFS:


tree
version=v2
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=100
objective=lambdarank
feature_names=Column_0 Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27 Column_28 Column_29 Column_30 Column_31 Column_32 Column_33 Column_34 Column_35 Column_36 Column_37 Column_38 Column_39 Column_40 Column_41 Column_42 Column_43 Column_44 Column_45 Column_46 Column_47 Column_48 Column_49 Column_50 Column_51 Column_52 Column_53 Column_54 Column_55 Column_56 Column_57 Column_58 Column_59 Column_60 Column_61 Column_62 Column_63 Column_64 Column_65 Column_66 Column_67 Column_68 Column_69 Column_70 Column_71 Column_72 Column_73 Column_74 Column_75 Column_76 Column_77 Column_78 Column_79 Column_80 Column_81 Column_82 Column_83 Column_84 Column_85 Column_86 Column_87 Column_88 Column_89 Column_90 Column_91 Column_92 Column_93 Column_94 Column_95 Column_96 Column_97 Column_98 Column_99 Column_100
feature_infos=[0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:0.6875] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:1] [0.0625:0.66666666666666663] [0.0625:0.75] [0.0625:1] [0.0625:1] [0.0625:0.5] [0.0625:1] [0.0625:0.5] [0.0625:0.5] [0.0625:0.5] [0.080414072462680952:1] [0.064750067944195744:1] [0.19943749011195866:1] [0:10577.456899206878] [0:7344.7041693973688] [0:2183.4647777018554] [0:4732.2191993076412] [0:1] [0:14] [0:3351.044805746792] [0:1] [0.11764705882352941:1] [0:2.5316795121586644] [0:3.7380800344091236] [0:2.9357296914870372] [0:3.145218091584959] [0:1] [0:2.0965004622571506] [0:1] [0:84.075849917948489] [0:10] [0:139.76646984770156] [1:1] [0:12276] [1:1] [1:1] [1:1] [0:1] [0:1] [0:2039.8949760996034] [0:1] [0:1] [0:1] [0:1] [0:1] [0:8] [0:1] [0:23] [0:11] [0:1] [0:1] [0:1] [0:1] [0:1] [0:1] [0:70.84810176457917] [0:215.38992789209462] [0:16] [0:149.02534452055693] [-1:-1] [0:54037] [-1:-1] [-1:-1] [-1:-1] [-22.870030939860886:1002.4930215402027] [0:1491.566509255349] [0:1491.566509255349] [0:1491.566509255349] [0:1491.566509255349] [0:1] [0:1] [0:1] [0:1] [0.0237517702061:0.99370693563099999] [0:0.83040188927199998] [0:1] [0:1] [0:1] [0:1] [0:1] [0:1] [0:1] [0:1] [0:1]
tree_sizes=180

Tree=0 num_leaves=1 num_cat=0 split_feature= split_gain= threshold= decision_type= left_child= right_child= leaf_value=0 leaf_count=0 internal_value= internal_count= shrinkage=1

end of trees

feature importances:

parameters: [boosting: gbdt] [objective: lambdarank] [metric: lambdarank] [tree_learner: serial] [device_type: cpu] [data: ] [valid: ] [num_iterations: 500] [learning_rate: 0.1] [num_leaves: 1023] [num_threads: 0] [max_depth: 10] [min_data_in_leaf: 20] [min_sum_hessian_in_leaf: 0.001] [bagging_fraction: 1] [bagging_freq: 0] [bagging_seed: 3] [feature_fraction: 1] [feature_fraction_seed: 2] [early_stopping_round: 0] [max_delta_step: 0] [lambda_l1: 0.01] [lambda_l2: 0.01] [min_gain_to_split: 0] [drop_rate: 0.1] [max_drop: 50] [skip_drop: 0.5] [xgboost_dart_mode: 0] [uniform_drop: 0] [drop_seed: 4] [top_rate: 0.2] [other_rate: 0.1] [min_data_per_group: 100] [max_cat_threshold: 32] [cat_l2: 10] [cat_smooth: 10] [max_cat_to_onehot: 4] [top_k: 20] [monotone_constraints: ] [feature_contri: ] [forcedsplits_filename: ] [refit_decay_rate: 0.9] [cegb_tradeoff: 1] [cegb_penalty_split: 0] [cegb_penalty_feature_lazy: ] [cegb_penalty_feature_coupled: ] [verbosity: 1] [max_bin: 255] [min_data_in_bin: 3] [bin_construct_sample_cnt: 200000] [histogram_pool_size: -1] [data_random_seed: 1] [output_model: LightGBM_model.txt] [snapshot_freq: -1] [input_model: ] [output_result: LightGBM_predict_result.txt] [initscore_filename: ] [valid_data_initscores: ] [pre_partition: 1] [enable_bundle: 1] [max_conflict_rate: 0] [is_enable_sparse: 1] [sparse_threshold: 0.8] [use_missing: 1] [zero_as_missing: 0] [two_round: 0] [save_binary: 0] [header: 0] [label_column: ] [weight_column: ] [group_column: ] [ignore_column: ] [categorical_feature: ] [predict_raw_score: 0] [predict_leaf_index: 0] [predict_contrib: 0] [num_iteration_predict: -1] [pred_early_stop: 0] [pred_early_stop_freq: 10] [pred_early_stop_margin: 10] [convert_model_language: ] [convert_model: gbdt_prediction.cpp] [num_class: 1] [is_unbalance: 0] [scale_pos_weight: 1] [sigmoid: 1] [boost_from_average: 1] [reg_sqrt: 0] [alpha: 0.9] [fair_c: 1] [poisson_max_delta_step: 0.7] [tweedie_variance_power: 1.5] [max_position: 3] [label_gain: ] [metric_freq: 1] [is_provide_training_metric: 0] [eval_at: ] [num_machines: 1] [local_listen_port: 12400] [time_out: 120] [machine_list_filename: ] [machines: ] [gpu_platform_id: -1] [gpu_device_id: -1] [gpu_use_dp: 0]

end of parameters


3. Library version infos:
mmlspark versions: 2.11
lightgbm on spark: 2.2.350
lightgbm on local machine: 2.3.0

welcome[bot] commented 4 years ago

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

imatiach-msft commented 4 years ago

hi @ElysiumFan086 I wonder if this is similar to this issue (based on a quick search): https://github.com/microsoft/LightGBM/issues/2953 you can also try to post this to the lightgbm forum. Is the dataset private? Can you create a small repro with either a toy dataset or the one you are using (if it is not private)? That may be the best way to debug this.

imatiach-msft commented 4 years ago

see some other similar issues: https://github.com/microsoft/LightGBM/issues/2597 another possibly related post: https://github.com/microsoft/LightGBM/issues/2239

Have you tried the latest code from master? I just updated to the latest LightGBM on master recently. This issue may have already been fixed.

ElysiumFan086 commented 4 years ago

@imatiach-msft Thank you for your advice, and due to privacy policy it is inconvenient to open the dataset here. But I have check the dataset, and It seems that for items, in amount of queries, have same ranking labels, which may result from my labeling strategy, and I will check it carefully. For example, ranking label in Query may be:

Q1: 4, 4, 4, 4, 4, 0, 0 
Q2: 2, 2, 2, 2, 2, 2, 4
Q3: 1, 1, 1, 1, 4, 2

But still I have several problems to trouble you:

For mmlspark, is there any way to modify min_data_in_leaf parameter for LightGBMRanker;
Although the labels of training data are a little unreasonable, yet it still cannot explain why it works on a single machine;
I am going to try the latest version, which asks me to upload the jar packages of mmlspark and related LigthGBM to our private platform. So is it convenient of you to give a URL to download the needed jars;

I hope these will not bother you much, and thank you again!

imatiach-msft commented 4 years ago

@ElysiumFan086 1.) That parameter should already be available on latest master: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMParams.scala#L357 2.) I'm guessing that either the native code might be different, or the distributed data partitioning code causes the logic to be slightly different to cause the issue - either way the fix will probably be in the LightGBM codebase 3.) I think this is latest version: Maven Coordinates

com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1-86-05c25aad-SNAPSHOT

Maven Resolver

https://mmlspark.azureedge.net/maven

This comes from this PR build: https://github.com/Azure/mmlspark/pull/866/checks?check_run_id=715637908

ElysiumFan086 commented 4 years ago

Thanks~~ @imatiach-msft I have tried the latest version, but still face the same problem. Currently, we have no idea about this issue, and what we can do is training on local machine, which just takes me more time.

wil70 commented 2 years ago

Hello, Any update on this? TY

ElysiumFan086 commented 2 years ago

This is RongFan. I have received you mail just now, and I will read it soon.Thank you!

microsoft / SynapseML

LightGBMRanker stopped training (No further splits with positive gain) #867