microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.57k stars 3.82k forks source link

`is_unbalance` does not work properly for evaluation #3687

Closed kevinorjohn closed 3 years ago

kevinorjohn commented 3 years ago

How you are using LightGBM?

LightGBM component: Python package

Environment info

Operating System: macOS Catanila 10.15.6

CPU/GPU model: CPU

Python version: 3.7.0

LightGBM version or commit hash: 3.1.1

Error message and / or logs

Train data, positive: 349, negative: 3151, positive ratio: 0.10
Test data, positive: 151, negative: 1349, positive ratio: 0.10
[LightGBM] [Info] Number of positive: 349, number of negative: 3151
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000737 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 3500, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.099714 -> initscore=-2.200403
[LightGBM] [Info] Start training from score -2.200403
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] train's binary_logloss: 0.232839    test's binary_logloss: 0.236435
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2] train's binary_logloss: 0.19447 test's binary_logloss: 0.199018
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3] train's binary_logloss: 0.167684    test's binary_logloss: 0.172805
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4] train's binary_logloss: 0.147018    test's binary_logloss: 0.152877
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5] train's binary_logloss: 0.129909    test's binary_logloss: 0.13641
[6] train's binary_logloss: 0.116018    test's binary_logloss: 0.122999
[7] train's binary_logloss: 0.104074    test's binary_logloss: 0.111615
[8] train's binary_logloss: 0.0938035   test's binary_logloss: 0.10199
[9] train's binary_logloss: 0.0849435   test's binary_logloss: 0.0936671
[10]    train's binary_logloss: 0.0765262   test's binary_logloss: 0.0858548
[Without weight] train binary logloss: 0.0765262104657861, test binary logloss: 0.08585479286835201
[With weight] train binary logloss: 0.20111509558258803, test binary logloss: 0.23334695880902565

[LightGBM] [Info] Number of positive: 349, number of negative: 3151
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000562 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 5100
[LightGBM] [Info] Number of data points in the train set: 3500, number of used features: 20
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.499205 -> initscore=-0.003179
[LightGBM] [Info] Start training from score -0.003179
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[1] train's binary_logloss: 0.599823    test's binary_logloss: 0.602056
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2] train's binary_logloss: 0.5234  test's binary_logloss: 0.527533
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3] train's binary_logloss: 0.459693    test's binary_logloss: 0.465714
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4] train's binary_logloss: 0.405868    test's binary_logloss: 0.413655
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5] train's binary_logloss: 0.359891    test's binary_logloss: 0.36924
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[6] train's binary_logloss: 0.320299    test's binary_logloss: 0.331169
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[7] train's binary_logloss: 0.285977    test's binary_logloss: 0.298289
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[8] train's binary_logloss: 0.256051    test's binary_logloss: 0.269504
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[9] train's binary_logloss: 0.229836    test's binary_logloss: 0.244618
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[10]    train's binary_logloss: 0.206785    test's binary_logloss: 0.222888
[Without weight] train binary logloss: 0.20814183616491222, test binary logloss: 0.21482494425887527
[With weight] train binary logloss: 0.20678476665278256, test binary logloss: 0.22288754310582817

Reproducible example(s)

import lightgbm
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

def log_loss(y_true, y_pred, pos_weight=None):
    is_positive = y_true == 1
    loss = np.zeros_like(y_pred)
    loss[is_positive] = -np.log(y_pred[is_positive])
    loss[~is_positive] = -np.log(1.0 - y_pred[~is_positive])
    if pos_weight:
        weights = np.ones_like(y_pred)
        weights[is_positive] = pos_weight
        return np.average(loss, weights=weights)
    else:
        return np.average(loss)

def train_lightgbm(train_X, test_X, train_y, test_y, pos_weight, set_unbalance=True):
    params = {
        "num_iterations": 10,
        "objective": "binary",
        "metrics": ["binary_logloss"],
        "seed": 0
    }
    if set_unbalance:
        params["is_unbalance"] = True
        train_dataset = lightgbm.Dataset(train_X, train_y)
        test_dataset = lightgbm.Dataset(test_X, test_y, reference=train_dataset)
    else:
        train_weights = np.ones_like(train_y)
        train_weights[train_y == 1] = pos_weight
        test_weights = np.ones_like(test_y)
        test_weights[test_y == 1] = pos_weight
        train_dataset = lightgbm.Dataset(train_X, train_y, weight=train_weights)
        test_dataset= lightgbm.Dataset(test_X, test_y, weight=test_weights, reference=train_dataset)

    model = lightgbm.train(params,
                           train_dataset,
                           valid_sets=[train_dataset, test_dataset],
                           valid_names=["train", "test"])
    train_preds = model.predict(train_X)
    test_preds = model.predict(test_X)
    print("[Without weight] train binary logloss: {}, test binary logloss: {}".format(log_loss(train_y, train_preds),
                                                                     log_loss(test_y, test_preds)))
    print("[With weight] train binary logloss: {}, test binary logloss: {}".format(log_loss(train_y, train_preds, pos_weight=pos_weight),
                                                                     log_loss(test_y, test_preds, pos_weight=pos_weight)))

def main():
    X, y = make_classification(5000, weights=[0.9, 0.1], flip_y=0.0, random_state=0)
    train_X, test_X, train_y, test_y = train_test_split(X, y, test_size=0.3, random_state=0)

    n_train_pos = train_y.sum()
    n_test_pos = test_y.sum()
    print("Train data, positive: {}, negative: {}, positive ratio: {:.2f}".format(n_train_pos,
                                                                                  len(train_y) - n_train_pos,
                                                                                  n_train_pos / len(train_y)))
    print("Test data, positive: {}, negative: {}, positive ratio: {:.2f}".format(n_test_pos,
                                                                                  len(test_y) - n_test_pos,
                                                                                  n_test_pos / len(test_y)))
    train_lightgbm(train_X, test_X, train_y, test_y, pos_weight=9, set_unbalance=True)
    train_lightgbm(train_X, test_X, train_y, test_y, pos_weight=9, set_unbalance=False)

if __name__ == "__main__":
    main()

Description

The parameter of is_unbalance would not properly assign label_weight for the evaluation. In the example, there is an unbalanced dataset with 10% positive instances and 90% negative instances. First, I set is_unbalance to True and got the training binary log loss of 0.0765262 and the test binary log loss of 0.0858548. However, if I directly assigned sample weights into the dataset, I got the different training/test binary log loss of 0.206785 and 0.222888 respectively. I expect is_unbalance would assign sample weights for both the objective function and the evaluation metric. But it seems that this parameter would only affect optimization objectives.

guolinke commented 3 years ago

is_unbalance and scale_pos_weight are designed for the objective function, so this behavior is expected. In most use cases, we don't want to change the distribution of evaluation data. If you want to change the distribution, you can directly use the sample weights as in your script.

kevinorjohn commented 3 years ago

Ok, I got it. Let me close this issue.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.