microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
MIT License
16.73k stars 3.84k forks source link

Tuning LambdaRank Objective with Sklearn's GroupKFold (RandomSearch) #5660

Open amir-rahnama opened 1 year ago

amir-rahnama commented 1 year ago


For the lambdarank objective, the Scikit-learn GroupKFold does not work. Is there a way to make this work? Below is a simple example.

Reproducible example

import numpy as np 
import pandas as pd 
import os 
from sklearn.model_selection import RandomizedSearchCV, GroupKFold
from sklearn.metrics import make_scorer, ndcg_score
import lightgbm

X = [[0.  , 0.  , 0.01],
    [1.  , 0.  , 1.  ],
    [0.43, 0.  , 0.  ],
    [0.43, 0.  , 0.4 ],
    [0.  , 0.  , 0.01],
    [0.  , 0.  , 0.31],
    [0.  , 0.  , 1.  ],
    [0.  , 0.  , 0.  ],
    [0.  , 0.  , 0.15]]

y = [0, 0, 1, 0, 3, 0, 4, 0, 0]
groups = np.array([0, 0, 0, 0, 0, 0, 0, 1, 1]).astype(int)  
flat_group = [7, 2]

## Training on the data works
gbm = lightgbm.LGBMRanker(objective='lambdarank'), y=y,group=flat_group)

### Random hyperparameter tuning doesn't work
hyper_params = {
    'n_estimators': [10, 20, 30, 40],
    'num_leaves': [20, 50, 100, 200],
    'max_depth': [5,10,15,20],
    'learning_rate': [0.01, 0.02, 0.03]

gkf = GroupKFold(n_splits=2)
folds = gkf.split(X, groups=groups)

grid = RandomizedSearchCV(gbm, hyper_params, n_iter=2, 
        cv=folds, verbose=3, scoring=make_scorer(ndcg_score), 

def group_gen(groups, folds):
    for train, _ in folds:
        yield np.unique(groups[train], return_counts=True)[1]

gen = group_gen(groups, folds), y, group=next(gen))

Which produces the following error:

StopIteration                             Traceback (most recent call last)
Cell In[102], line 1
----> 1, y, group=next(gen))


Environment info

Sklearn: 1.1.3 LightGBM: 3.2.1

Additional Comments

The code I pasted is inspired by the solution given in which refers to However, this does not work in our case.

Any help would be appreciated.

replacementAI commented 1 year ago

Unfortunately I dont think sklearn supports ranking estimators, tho I could be wrong.

amir-rahnama commented 1 year ago

@replacementAI Thank you for your feedback. Is there a way to tune the parameters of LightGBM in cross-validation when it comes to ranking models? I tried optuna.integration.lightgbm.LightGBMTuner but it also doesn't work for ranking scenarios.