microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

Custom LambdaRank NDCG not matching built-in code #5735

Open MhmdSaiid opened 1 year ago

MhmdSaiid commented 1 year ago

Hello, I am trying to implement the LambdaRank objective in python so that I could later change the metric NDCG to one that I want. I started by implementing it with NDCG, so that I can compare with the built-in implementation. Unfortunately, the NDCG@1 metric is not changing every iteration. Could someone please have a look? FYI, I have used the equations for the gradients and hessians in the paper.

Here is the custom objective function: (The positions and qids are input as I need them to compute NDCG).

def custom_lambdarank(preds, train_data,positions,qids):

    df = train_data.get_data().copy()
    df['relevance'] = train_data.get_label()

    df['position'] = positions
    df['search_id'] = qids

    # compute dcg and gain
    df['discount'] = 1 / np.log2(2 + df['position'])
    df['gain'] = (2**df['relevance'] - 1)

    df['pred'] = preds

    # outer join
    swaps = df.merge(df, on='search_id', how='outer')

    #compute delta and rho
    swaps['delta'] = np.abs((swaps['discount_x'] - swaps['discount_y']) * (swaps['gain_x'] - swaps['gain_y']))
    swaps['rho'] = 1 / (1 + np.exp(swaps['pred_x'] - swaps['pred_y']))

    #initialize lambda and hessian
    swaps['lambda'] = 0
    swaps['hessian'] = 0

    # choose slice where x > y
    slice_x_better =swaps[swaps['relevance_x'] > swaps['relevance_y']]
    swaps.loc[swaps['relevance_x'] > swaps['relevance_y'], 'lambda'] = slice_x_better['delta'] * slice_x_better['rho']
    swaps.loc[swaps['relevance_x'] > swaps['relevance_y'], 'hessian'] = slice_x_better['delta'] * slice_x_better['rho'] * (1 - slice_x_better['rho'])

    # compute lambda
    lambdas_x = swaps.groupby(['search_id', 'position_x'])['lambda'].sum().rename('lambda')
    lambdas_y = swaps.groupby(['search_id', 'position_y'])['lambda'].sum().rename('lambda')
    lambdas = lambdas_x - lambdas_y

    # compute hessian
    hessian_x = swaps.groupby(['search_id', 'position_x'])['hessian'].sum().rename('hessian')
    hessian_y = swaps.groupby(['search_id', 'position_y'])['hessian'].sum().rename('hessian')
    hessians = hessian_x - hessian_y
    lhs = pd.concat([lambdas, hessians], axis = 1)

    df = df.merge(lhs, left_on=['search_id', 'position'], right_on=['search_id', 'position_x'], how='left')

    grads = list(df['lambda'])

    hessians = list(df['hessian'])
    return grads, hessians

This is how I am calling it:

best_params = {'n_estimators': 100,
             'learning_rate': 0.5,
             'num_leaves': 10,
             'max_depth': 10,
             'boosting_type': 'gbdt',
            'random_state': 42,
            'n_jobs':-1,
            'force_row_wise':True}

from functools import partial

ranker=lgb.train(params=best_params,
                 train_set=train_data,
                 valid_sets=[eval_data],valid_names=['validation']
                 fobj=partial(custom_lambdamart,positions = pos, qids = qids),
                 feval=partial(ndcg_at_1,qids = eval_qids),
                 num_boost_round = best_params['n_estimators']
                )

ndcg_at_1() is an implementation of NDCG@1. It gives similar results to the built-in method.

Thanks in advance.

rakshitsakhuja commented 1 year ago

Following this

rakshitsakhuja commented 1 year ago

I tried something similar, I was actually doing below steps:

Trained Lambdarank for NDCG metric Performed model predict on test data I reranked the results and apply standard ndcg formula to check the score for test data but results were different

If there is any code available to doapply custom metrics post model predict.

Ps. I know model.eva() is available but I want to check through my custom metrics as well