microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

Support for non integer labels for the ranking task #5423

Closed prashantbudania closed 1 year ago

prashantbudania commented 2 years ago

I am working on a ranking task where the labels are a mix of integers [3,2,1] and floats [0.56, 0.34, ...]. I am unable to use the dataset in its current format as Lightgbm doesn't support non-integer labels for the ranking task.

This is the error I am getting: LightGBMError: label should be int type (met 0.560000) for ranking task, for the gain of label, please set the label_gain parameter

If the issue is that only integer labels can be supported with custom label_gain values, you can add support for non-integer labels where the label_gain is fixed as 2^label - 1.

jameslamb commented 2 years ago

Thanks for using LightGBM.

Can you tell us what the real-world meaning of your labels are? Like for example, what is the difference between y=0.34 and y=3?

Depending on your answer to that question, a regression objective might be effective for the task you're working on.

prashantbudania commented 2 years ago

I am working on a ranking objective where the goal is to show the most relevant documents given a query ranked by a 'score'. And for the training data, I am using click-data where positive user interactions would lead to a label of 3/2/1. But there are also a lot of documents that are 'unseen' by users (documents beyond the last clicked document) and instead of assigning them a label of 0, I am assigning them labels between 0 and 1.

The hypothesis is that the original ranker used to produce and collect these click signals is good i.e. its ordering is good. So, I am providing that information to the ltr model in the form of labels on the 0-1 scale. I want the model to learn what makes document A have a score of 0.9 vs document B have a score of 0.3. Setting all these to 0 would throw away this information and I have a really big set of unseen documents.

And the task is still very much a ranking task - the exact label scores don't matter that much; only the ordering matters. And because the labeling scale is not a standard scale and because some of the gold scores represent a normalized score between 0 and 1, don't think the regression objective would help here - the scale and the logic are based on intuition.

Let me know if this was enough of an explanation of what I am trying to do. If not, I can provide more information.

jameslamb commented 2 years ago

It's hard for me to reconcile these two statements:

But I'm not that knowledgeable about ranking problems, so I'll defer to others like @shiyu1994 and @guolinke to hopefully provide some guidance for you.

prashantbudania commented 2 years ago

Sorry for the confusion - I think I didn't word it properly.

What I meant to say was that I want the model to learn document A is more relevant (because our current ranker outputted a higher score for it) - the exact numbers (0.9 or 0.3) don't matter that much - I am just using a normalized score to tell the ltr model that document A is in fact more relevant than document B.

jameslamb commented 2 years ago

I am just using a normalized score to tell the ltr model that document A is in fact more relevant than document B

If that's the case, then couldn't you recode all of them as integers? That doesn't have to imply "setting all these to 0".

For example, like this:

That would preserve the relative ordering and is a structure that any learning-to-rank approach accepting integers for ranking could understand.

And you could divide the space from 0.0 to 1.0 into as many bins as you want.

prashantbudania commented 2 years ago

Yeah, that's one possibility but I am just worried if there could be any negative impact of training with a wide integer scale 0/1/2/3/4/... I am using the lambdamart models for this task (i.e. mart with the lambdarank loss) and lambdarank loss is dependent on the delta in the nDCG scores produced by a swap.

Let's say I am swapping documents at index 0 and 2 with labels 0.1 and 3. If I am using a wider scale (to enable integer coding for the floating point labels), I would be swapping documents with labels 0 and 4, producing a higher delta/loss.

By the way, I modified the dcg_calculator.cpp file to enable support for non-integer labels and built it from the source. Right now, I am getting an nDCG score of 1 all the time so I guess I misunderstood something in your implementation (also my knowledge of C++ is extremely limited) but working on fixing it - should be done in the next hour or so.

prashantbudania commented 2 years ago

Yeah, it works now - my local compiled copy now has support for non-integer labels.

prashantbudania commented 2 years ago

Screen Shot 2022-08-22 at 4 06 18 PM With this, I do lose the support for having any arbitrary label_gain function defined by the user as I am hardcoding the label_gain function to be 2^label - 1 but works for me as I am only planning on using this function.

For generalizability (or extending support for user-defined label gains), I can create a new function called transformation function, and there, by default, it would transform x to 2^x-1 but if a user provided their own label gain function, it won't transform it returning x for the input x.

jameslamb commented 1 year ago

@prashantbudania sorry for the delayed response.

It's difficult for me to tell, from a screenshot of a file diff, what exactly you're proposing and the impact it'll have on LightGBM. A pull request with the proposed code changes would be clearer.

I'm also still not understanding, from your description of the problem you're working on, why adding support for non-integer labels is necessary. It seems to me that LambdaRank with integer-coding like I suggested in https://github.com/microsoft/LightGBM/issues/5423#issuecomment-1222802535 + some modifications of label_gain should be enough. But very possible that my confusion is just from my lack of experience with learning-to-rank.

@ffineis if you have time, could you give us your opinion on this request?

jameslamb commented 1 year ago

@prashantbudania I should have also said...if you've used other libraries that support ranking on labels like the ones you've described, links to documentation and examples would be appreciated.

ffineis commented 1 year ago

Hello!

From my understanding of the request, @prashantbudania is asking for "ranking labels" to be continuously valued - non-integer labels for the ranking task. The ranking label is referred to as $reli$ in the DGC formula $$DCG = \sum{i=0}^{n_{g} - 1}\frac{2^{reli} - 1}{\log{2}(i + 2)}$$

Each label's "gain" are the numerator values: $2^{rel_i} - 1$.

To me, there are several issues with allowing "ranking labels" to be floats, because we'd be allowing labels to be continuously-valued.

  1. Labels are inherently discrete values: cat/dog, 1, 2, 3, 4, 5 stars, etc.
  2. Allowing labels to be continuous would break the meaning and interpretation of the label_gain parameter, which is a map from ranking labels to their corresponding gain values. When no label_gain is provided, the DefaultLabelGain method makes a default vector holding the values [0, 1, 3, 7, ..., 2^31 - 1] (thus allowing up to 32 ranking labels by default). When label_gain is provided, the numerator values in the DGC formula are swapped out with those in the provided label_gain. If labels are truly continuous, then users can't create discrete label_gain sets.
  3. Basically all of the LETOR literature and tools I've ever seen assume discrete ranking labels. I'm not disputing this is for a good reason, instead of just allowing users to provide the gain as a vector of floats! Just saying that it doesn't seem like common use case, needing an uncountable number of ranking labels. Using floats as "labels" in xgboost with the rank:ndcg objective is giving me very weird results. I wouldn't guess that floats-as-labels would well with other tools like SVM-Rank or RankLib.

What @jameslamb is proposing above is the solution to what the requester is asking IMO. @prashantbudania there shouldn't be anything preventing you from doing the following:

  1. Either finely discretize the space [0, 1] into bins, or even use a 1-1 label-to-gain mapping, allowing you to make a "label" for 0.571 and another for 0.572 (which I'm not sure if you want to be $2^{0.571} - 1$ or just 0.571).
  2. Replace the ranking labels in the train/val/test sets with the integers 0, 1, ... |label_gain| so that the labels conform to discrete ranking labels.

Perhaps a more generalizable solution would be to allow users to provide a custom ranking fobj and feval - I don't think this is possible today? The lambarank objective is well defined, but I think if users could define a modified lambarank gradient that allowed them to run $\frac{2^{reli} - 1}{\log{2}(i + 2)}$ (or anything) using their own continuously-valued rel vector (instead of labels). Then they wouldn't even need to define the custom label_gain vector.

jameslamb commented 1 year ago

🤩 Thanks so much @ffineis ! I really really appreciate you sharing your expertise here, and the references to other projects like XGBoost, and RankLib.

allow users to provide a custom ranking fobj and feval - I don't think this is possible today

I think it should be possible.

In lightgbm.train(), you can provide a custom objective function that will be passed the Dataset object at each training iteration.

https://github.com/microsoft/LightGBM/blob/61e464bc02bd325672ce71daabfa57d569cf02bc/python-package/lightgbm/engine.py#L105-L119

It should be possible to construct a Dataset with continuous labels + the group for grouping queries, and then that objective function could access Dataset.get_group() to get back the query definitions needed to compute gradients and hessians for a ranking objective.

import lightgbm as lgb
import numpy as np
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=1_000)

dtrain = lgb.Datasets(data=X, label=y, group=[50, 50])

def _custom_objective(preds: np.ndarray, dtrain: lgb.Dataset):
    rel = dtrain.get_label()
    group = dtrain.get_group()
    # ... the hard part where you figure out the gradient and hessian ...

Because the objective function is able to access the group and see the entire Dataset (instead of being evaluated pointwise), it's should be possible to do something like ranking that requires information from multiple related rows.

And that situation where you provide a custom objective function to lightgbm.train(), LightGBM doesn't try to figure out whether you're doing regression, classification, or ranking. It's just performing boosting one iteration at a time, based on whatever gradients your objective function returns. So it won't, for example, say "hey it looks like you're doing ranking, you can't have non-integer labels".

That's generally true throughout LightGBM. For example, predict_proba() over in lightgbm.sklearn.LGBMClassifier falls back to just returning raw predictions when you provide a custom objective function.

https://github.com/microsoft/LightGBM/blob/61e464bc02bd325672ce71daabfa57d569cf02bc/python-package/lightgbm/sklearn.py#L1143-L1147


I'm going to tag this issue awaiting response so that it'll automatically be closed if we don't hear more from @prashantbudania in the next 30 days. I think the original question has been thoroughly answered.

@ffineis if you ever want to pursue this idea of a custom objective function for ranking with continuous labels, we'd welcome a contribution to https://github.com/microsoft/LightGBM/tree/master/examples/python-guide!

github-actions[bot] commented 1 year ago

This issue has been automatically closed because it has been awaiting a response for too long. When you have time to to work with the maintainers to resolve this issue, please post a new comment and it will be re-opened. If the issue has been locked for editing by the time you return to it, please open a new issue and reference this one. Thank you for taking the time to improve LightGBM!

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.