microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

LightGBMRanker on spark group/query parameter explain #682

Closed yinshurman closed 5 years ago

yinshurman commented 5 years ago

Is there any documentation on how to use LightGBMRanker on spark ? specifically, how to set the query data as non-spark version of LightGBMRanker used a very special rule:that is [12,10,....] means first 12 items belongt to first group,and the next 10 belong to second ..., so ,how dose spark version set the value? I can see there is a groupCol parameter ,so what is this column's type?(can it be string id which represents a group,or must be bigint ?),and what is the rule of this column?(is it as just the non-spark version, in the above example,the first group's values are all 12,and the next group's values are 10 ,etc. or it can just be the group id ?).lastly,how many class can labelCol's value be? Is there also a max number of class limit just like usual lgb? Hope to get response as soon as possible ,because I am in a matter of greate urgency ! thanks very much.

imatiach-msft commented 5 years ago

@yinshurman sorry about the trouble you are having. I definitely agree with you that we need more/better documentation for some features. The query column in mmlspark is actually a column with values per row, and the ids have to be unique for different groups: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/core/contracts/Params.scala#L164 The rows can be in any order. Internally, I do a sort on the partitions on the column and then create the very special rule you mentioned above: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/TrainUtils.scala#L98

Note all the comments in the function there, eg:

      // Convert to distinct count (note ranker should have sorted within partition by group id)
      // We use a triplet of a list of cardinalities, last unqiue value and unique value count

So in some sense it is simpler on mmlspark side. The only unfortunate thing is there is a sort, but only within partitions: https://github.com/Azure/mmlspark/blob/master/src/main/scala/com/microsoft/ml/spark/lightgbm/LightGBMRanker.scala#L75 Note that for best accuracy and correctness, we need to ensure that all entries within a group are inside one partition, for example if some group is split among multiple partitions LightGBMRanker will pass but the ranking metric (eg NDCG) will probably suffer. LightGBMRanker is very knew, so if you have any ideas on how this could be improved please let me know!

yinshurman commented 5 years ago

@imatiach-msft Thanks very much ! You are always so kindly !
According to my recent experience with LightGBMRanker on spark, I'd say it was quite frustrating to deal with the error information, for example ,it is nonsense to require the weightCol must be Double type ( I used Long ,and in the runtime it raise a java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Double error, and I spent a lot time to track the error where it is, finally in the TrainUtil find that WeightCol used something like Row.getDouble,(Because the same Long type weight column works well in Xgboost4j-spark , so it surprised me a lot !). Another thing is about the evaluation, I don't find any parameter to evaluate the fitted model ,or any parameter to set the Metric like "MAP","NDCG". Due to the special favor of Ranking task ,I had to write my own train-evaluation-split function to ensure it is randomly sampled by with group, by the way, I indeed found a validationIndicatorCol parameter, but it seems weird to hard code which row should or should't be trained(as i understand the evaluation set should be randomly selected during every training epoch ,the right parameter should be a fraction parameter to indicate how many samples should be leaved out as a validation set from the train set ,maybe another parameter should be the strategy of the random selection algorithms). Also, I am curious about the algorithms the LGBMRanker used .As I know there are LambdaMart algorithms for learning to rank ,but it has pairwise、listwise versions ,so what is LGBMRanker's choice? In addition ,the setObjective parameter's doc only said

    "regression_l2, regression_l1, huber, fair, poisson, quantile, mape, gamma or tweedie. " +
    "For classification applications, this can be: binary, multiclass, or multiclassova. ")

, it said nothing about ranking! can we have "rank:ndcg" 、"rank:pairwise","rank:map" as a candidate?

imatiach-msft commented 5 years ago

@yinshurman "it is nonsense to require the weightCol must be Double type" will look into fixing this "I don't find any parameter to evaluate the fitted model" this is fixed and should be in next release: https://github.com/Azure/mmlspark/pull/672 MAP and NDCG should be available, see: https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst#metric "I am curious about the algorithms the LGBMRanker use" see objective: https://github.com/microsoft/LightGBM/blob/master/docs/Parameters.rst#objective paper it links to: https://papers.nips.cc/paper/2971-learning-to-rank-with-nonsmooth-cost-functions.pdf hope this helps!

imatiach-msft commented 5 years ago

@yinshurman sent PR to allow Int and Long type for weightCol https://github.com/Azure/mmlspark/pull/688

yinshurman commented 5 years ago

@imatiach-msft That's pretty nice work !