microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.73k stars 3.84k forks source link

Tuning LambdaRank Objective with Sklearn's GroupKFold (RandomSearch) #5660

Open amir-rahnama opened 1 year ago

amir-rahnama commented 1 year ago

Description

For the lambdarank objective, the Scikit-learn GroupKFold does not work. Is there a way to make this work? Below is a simple example.

Reproducible example

import numpy as np 
import pandas as pd 
import os 
from sklearn.model_selection import RandomizedSearchCV, GroupKFold
from sklearn.metrics import make_scorer, ndcg_score
import lightgbm

X = [[0.  , 0.  , 0.01],
    [1.  , 0.  , 1.  ],
    [0.43, 0.  , 0.  ],
    [0.43, 0.  , 0.4 ],
    [0.  , 0.  , 0.01],
    [0.  , 0.  , 0.31],
    [0.  , 0.  , 1.  ],
    [0.  , 0.  , 0.  ],
    [0.  , 0.  , 0.15]]

y = [0, 0, 1, 0, 3, 0, 4, 0, 0]
groups = np.array([0, 0, 0, 0, 0, 0, 0, 1, 1]).astype(int)  
flat_group = [7, 2]

## Training on the data works
gbm = lightgbm.LGBMRanker(objective='lambdarank')
gbm.fit(X=X, y=y,group=flat_group)

### Random hyperparameter tuning doesn't work
hyper_params = {
    'n_estimators': [10, 20, 30, 40],
    'num_leaves': [20, 50, 100, 200],
    'max_depth': [5,10,15,20],
    'learning_rate': [0.01, 0.02, 0.03]
}

gkf = GroupKFold(n_splits=2)
folds = gkf.split(X, groups=groups)

grid = RandomizedSearchCV(gbm, hyper_params, n_iter=2, 
        cv=folds, verbose=3, scoring=make_scorer(ndcg_score), 
        error_score='raise')

def group_gen(groups, folds):
    for train, _ in folds:
        yield np.unique(groups[train], return_counts=True)[1]

gen = group_gen(groups, folds)

grid.fit(X, y, group=next(gen))

Which produces the following error:

---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
Cell In[102], line 1
----> 1 grid.fit(X, y, group=next(gen))

StopIteration: 

Environment info

Sklearn: 1.1.3 LightGBM: 3.2.1

Additional Comments

The code I pasted is inspired by the solution given in https://github.com/microsoft/LightGBM/issues/1137 which refers to https://github.com/Microsoft/LightGBM/blob/4df7b21dcf2ca173a812f9667e30a21ef827104e/python-package/lightgbm/engine.py#L267-L274. However, this does not work in our case.

Any help would be appreciated.

replacementAI commented 1 year ago

Unfortunately I dont think sklearn supports ranking estimators, tho I could be wrong.

amir-rahnama commented 1 year ago

@replacementAI Thank you for your feedback. Is there a way to tune the parameters of LightGBM in cross-validation when it comes to ranking models? I tried optuna.integration.lightgbm.LightGBMTuner but it also doesn't work for ranking scenarios.