microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.01k stars 825 forks source link

Training LightGBMRanker several times gives different NDCG on testing set #580

Open daureg opened 5 years ago

daureg commented 5 years ago

I noticed that when training on Databricks with the same parameters on the same data several times, the resulting models don't give the same predictions, as evidenced by different NDCG on a separate testing set. Here is my training function, my training set has 400K exemples in 5K lists, with 60 features:

def train(): Unit = {
  val lgbm = new LightGBMRanker()
  .setCategoricalSlotIndexes(Array(0, 2, 3, 4, 6, 7, 8, 59))
  .setFeaturesCol("features")
  .setGroupCol("query_id")
  .setLabelCol("label")
  .setMaxPosition(10)
  .setParallelism("voting")
  .setNumIterations(15)
  .setMaxDepth(4)
  .setNumLeaves(12)
  val training = table(s"training")
  val model = lgbm.fit(training)
}

Is that inherent to distributed training (on 5 executors) or should I change some parameters of my LightGBMRanker instance?

daniloascione commented 5 years ago

If the table is repartitioned to 1 partition (table(s"training").repartition(1)), then the results are consistent, but this means no parallelism.

imatiach-msft commented 5 years ago

@daureg thank you for reporting this issue. This looks similar to the issue here: https://github.com/Azure/mmlspark/issues/564 I will need to investigate this problem more to figure out the root cause of the randomness, I'm not sure if it is fixable. It's on my todo list now, but not as high priority as: https://github.com/Azure/mmlspark/issues/569 https://github.com/Azure/mmlspark/issues/483 Does one model always give the same predictions? Or is it only different models trained on the same data?

daureg commented 5 years ago

indeed it's the same as #564 (unless there is something specific with ranker, but most likely not). I will also try to predict several time with the same model, but for now it's different models trained on the same data

daniloascione commented 5 years ago

@imatiach-msft maybe there is the need to ensure that each partition get all the elements from the same group and to enforce the group sorting by adding a sortWithinPartition here https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMBase.scala#L45 (similarly to https://github.com/Azure/mmlspark/blob/master/src/lightgbm/src/main/scala/LightGBMRanker.scala#L67)

imatiach-msft commented 5 years ago

@daniloascione yes, that was something that I was going to add later; not sure if it should be a separate utility or if it should be done in the ranker itself (which may hurt performance significantly since it would incur a shuffle across partitions) - note it wouldn't go into LightGBMBase because that's the base class for classifier and regressor as well, and this is something needed just for ranker. I sort in LightGBM Ranker so that the groups are ordered, but I don't ensure that a group doesn't cross partitions; as you said one group should only be in one partition in the ranker case. I'm not sure if it is related to your specific issue though. Even if the same group is in each partition you may still get different results from run to run, although at least the difference should be smaller from model to model.

imatiach-msft commented 5 years ago

@daniloascione @daureg just out of curiosity, how are you computing the NDCG? I would like to add an evaluator for LGBMRanker, similar to the Spark ML evaluators and MLLib metrics. Is there one that exists out there already? I couldn't find anything in Spark ML.

daniloascione commented 5 years ago

@imatiach-msft I tried to add ranking metrics in Spark ML in the past (https://github.com/apache/spark/pull/16618 and https://issues.apache.org/jira/browse/SPARK-14409) but things got stuck for several reasons. Currently, we are using an udf based implementation of ndcg, which is similar to this one http://lobotomys.blogspot.com/2016/08/normalised-discounted-cumulative-gain.html

kbafna-antuit commented 4 years ago

@daniloascione @daureg I am facing a similar issue where in training the model on the same data with same parameters result in different predictions each time. Did you find a fix for this ?

daniloascione commented 4 years ago

@KeertiBafna No, I didn't find a fix, unfortunately. I haven't tried the idea to "sort within partitions" yet (see above), maybe it is time to look at this.

kbafna-antuit commented 4 years ago

@daniloascione Can i use repartitioning by a key as below ? Say for ex: If i repartition my data into 8 partitions and add a column 'key' with values from 0 to 7, will the below line ensure each partition has the same key group and order everytime ? df.repartition(8, 'key').sortWithinPartitions('order_col')

daniloascione commented 4 years ago

Yes, I think so, the partition should be sorted at least until the next operation with a shuffle. I recommend you to write tests anyway.

daniloascione commented 4 years ago

@imatiach-msft is this issue solved in later versions? I believe you mentioned in another issue that you added a sortwithinpartitions to preserve the sorting.