microsoft / NimbusML

Python machine learning package providing simple interoperability between ML.NET and scikit-learn components.
Other
281 stars 62 forks source link

Lambda ranker does not throw error when labels are strings as oppose to ordinal int. #94

Open dataninjia opened 5 years ago

dataninjia commented 5 years ago

My team is currently using LightGBMrank through nimbus for some ranking problems. However, we are a bit confused about the data type required for the label column – I couldn’t find too much documentation on this.

I tried a few iteration based off of the default example given in the LightGBMrank documentation, which had ordinal labels. Here are the iterations I tried:

  1. The default, with ordinal labels
  2. Changed data input to a data frame to make sure output is the same. It is.
  3. Remapped labels to str format {0: “Bad”, 1: “Fair”, 2: “Good”, 3: “Excellent”}.
  4. Remapped the ordering, and added a random label “Goofy”

The results of these 4 on NDCG are different, and none broke the classifier.

The ipython notebook attached has code to reproduce the issue.

lambdaRankTest.zip

Thanks, Mike

TomFinley commented 5 years ago

My own opinion to NimbusML effort is that this is a case where the effort to be "helpful" in the API has backfired. While in, say, multiclass classification the order in which classes are assigned is unimportant, in the case of ranking it is really important. My own first thought is that ranking should desist from trying to "help" in this manner, and should instead, if someone feeds in an inappropriate type (like a string), offer some suitably prescriptive advice on what they should do to map it to an appropriate type, rather than trying to "guess," which will almost certainly result in undesirable consequences.