microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.48k stars 3.82k forks source link

Random seed with boosting_type='gbdt' takes no effect #2835

Closed rdbuf closed 4 years ago

rdbuf commented 4 years ago

Environment info

Operating System:

Linux jupyter 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

CPU/GPU model:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel Xeon Processor (Cascadelake)
Stepping:            6
CPU MHz:             2095.072
BogoMIPS:            4190.14
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
L3 cache:            16384K
NUMA node0 CPU(s):   0-23
NUMA node1 CPU(s):   24-47
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed

C++/Python/R version: Python 3.7.3

LightGBM version or commit hash: 2.3.1

Error description

Setting seed on LGBM with boosting_type='gbdt' takes no effect. The output of the model is always the same.

For other boosting_types, it works.

Reproducible examples

import lightgbm
import numpy as np
import pandas as pd
import sklearn

np.random.seed(30)
n_columns = 40
n_rows = 10000

X = pd.DataFrame(np.random.uniform(-100,100,size=(n_rows, n_columns)), columns=list(range(n_columns)))
y = pd.Series(np.random.randint(0, 2, size=(n_rows)))

X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y)

def run(seed, engine='gbdt'):
    params = dict(
        boosting_type=engine,

        # I've set all the parameters related to randomness I could find,
        # just in case, but it didn't help:
        seed=seed,
        random_seed=seed,
        random_state=seed,
        data_random_seed=seed,
        feature_fraction_seed=seed,
        objective_seed=seed,
        bagging_seed=seed,
        extra_seed=seed,
        drop_seed=seed
    )

    train_dataset = lightgbm.Dataset(X_train, y_train)
    model = lightgbm.train(params=params, train_set=train_dataset)

    y_hat = model.predict(X_test)

    return y_hat

# This is expected to print False:
print((run(seed=2) == run(seed=4)).all())

# As it does here, for example:
print((run(seed=2, engine='goss') == run(seed=4, engine='goss')).all())

Steps to reproduce

  1. Run the above script.
  2. Expected output would be: False, False.
  3. Actual output: True, False.
StrikerRUS commented 4 years ago

Hey @rdbuf !

Your training doesn't involve any random process. That's why results are identical.

Also, below are some comments about the usage of seeds.

Firstly, you shouldn't use all aliases. Use only one of them. You can see a corresponding warning in the training logs:

[LightGBM] [Warning] seed is set with random_seed=4, random_state=4 will be ignored. Current value: seed=4

Secondly, you don't need to specify all concrete seeds if you set the general seed and vice versa. Refer to the parameter's description:

this seed is used to generate other seeds, e.g. data_random_seed, feature_fraction_seed, etc. https://lightgbm.readthedocs.io/en/latest/Parameters.html#seed

Thirdly, it is better to pass data_random_seed to Dataset parameters.

Lastly, introduce some randomness via your training params and see expected output:

...
    params = dict(
        boosting_type=engine,

        # I've set all the parameters related to randomness I could find,
        # just in case, but it didn't help:
        seed=seed,
        feature_fraction=0.7
    )
...
rdbuf commented 4 years ago

Thank you so much @StrikerRUS! It is a lot more clear now what's going on. The issue can now be closed, I believe.