microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
MIT License
16.72k stars 3.84k forks source link

60x Slowdown Using Concurrency #6728

Closed JohanLoknaQC closed 11 hours ago

JohanLoknaQC commented 4 days ago


There seems to be a clear issue related to how lightgbm handles resource sharing. When restricting the number of cores associated with a process, the runtime increases significantly.

In the example provided below, the run time using all cores (0-15) is about 1.821 seconds. When restricting the process to all cores but one (0 - 14), the runtime increases to 109.31 seconds; more than a 60x increase. This only happens if the resource restriction is done from within the Python script. If the affinity is set beforehand using taskset -c 0-14 the runtime is approximately the same, 1.796 seconds.

This makes training multiple lightgbm models in parallel undesirable, at least if the subprocesses are called from within a Python script. As this a common pattern of implementing concurrency, this appears to be a limitation which can hopefully be easily addressed and fixed.


Reproducible example

import argparse
import lightgbm as lgb
import numpy as np
import os


def main(use_setaffinity: bool = False, use_taskset: bool = False):

    n = os.cpu_count()

    # Set affinity using ``os.sched_setaffinity``
    if use_setaffinity:
        os.sched_setaffinity(0, set(range(n - 1)))
        os.environ['OMP_NUM_THREADS'] = str(n - 1)  # Added after suggestion

    # Set affinity using ``taskset``
    if use_taskset:
        pid = os.getpid()
        command = f"taskset -cp 0-{n - 2} {pid}"
        os.environ['OMP_NUM_THREADS'] = str(n - 1)  # Added after suggestion

    # Generate a data set
    nrows, ncols = 1_000, 10
    X = np.random.normal(size=(nrows, ncols))
    y = X @ np.random.normal(size=ncols) + np.random.normal(size=nrows)

    lgb_train = lgb.Dataset(X, y)

    # Train model
    params = {
        "objective": "regression",
        "metric": "rmse",
        "num_leaves": 31,  # the default value
        "learning_rate": 0.05,
        "feature_fraction": 0.9,
        "bagging_fraction": 0.8,
        "bagging_freq": 5,
        "verbose": 0
    lgb.train(params, lgb_train)

if __name__ == "__main__":

    parser = argparse.ArgumentParser()



    args = parser.parse_args()

time python > /dev/null 2>&1
time python  --use-setaffinity > /dev/null 2>&1
time python  --use-taskset > /dev/null 2>&1
time taskset -c 0-14 python > /dev/null 2>&1


# Using all cores
real    0m1.821s
user    0m4.394s
sys     0m0.178s

# Using ``set_affinity`` from within the process
real    1m49.313s
user    25m44.344s
sys     0m1.109s

# Using ``taskset`` from within the process
real    1m48.820s
user    25m54.104s
sys     0m0.959s

# Using ``taskset`` before initializing the process
real    0m1.796s
user    0m4.135s
sys     0m0.203s

Environment info

LightGBM version or commit hash:

liblightgbm  4.5.0    cpu_h155599f_3  conda-forge
lightgbm     4.5.0    cpu_py_3        conda-forge

Command(s) you used to install LightGBM

micromamba install lightgbm

Other used packages:

numpy     1.26.4   py312heda63a1_0  conda-forge

The example was run on an AWS instance (ml.m5.4xlarge) with 16 cores.

Additional Comments

jmoralez commented 4 days ago

Hey @JohanLoknaQC, thanks for using LightGBM.

By default LightGBM uses all available threads on the machine unless you tell it otherwise. So in your examples you're submitting n tasks and assigning only n - 1 threads, so they have to fight each other to execute them. I think the easiest way to fix this is by doing something like os.environ['OMP_NUM_THREADS'] = str(n-1), that way you tell LightGBM to use the number of threads that you've limited the process to have.

JohanLoknaQC commented 1 day ago

Thanks a lot for the answer! However, after adding the suggested fix (see code above) the run-times remains virtually unchanged. It does seem like something else might be causing this additional run-time.

jmoralez commented 1 day ago

Sorry, I think that only works if provided through the command line. Can you please set the num_threads argument instead? e.g.

    params = {
        "objective": "regression",
        "metric": "rmse",
        "num_leaves": 31,  # the default value
        "learning_rate": 0.05,
        "feature_fraction": 0.9,
        "bagging_fraction": 0.8,
        "bagging_freq": 5,
        "verbose": 0,
        "num_threads": n - 1,  # <- set this
JohanLoknaQC commented 11 hours ago

Thank you very much - this solved this issue.

Just for reference, it also worked when the affinities were set quite arbitrarily, e.g. 3-12. It therefore seems to a quite general solution. 👍