microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.72k stars 3.84k forks source link

60x Slowdown Using Concurrency #6728

Closed JohanLoknaQC closed 11 hours ago

JohanLoknaQC commented 4 days ago

Description

There seems to be a clear issue related to how lightgbm handles resource sharing. When restricting the number of cores associated with a process, the runtime increases significantly.

In the example provided below, the run time using all cores (0-15) is about 1.821 seconds. When restricting the process to all cores but one (0 - 14), the runtime increases to 109.31 seconds; more than a 60x increase. This only happens if the resource restriction is done from within the Python script. If the affinity is set beforehand using taskset -c 0-14 the runtime is approximately the same, 1.796 seconds.

This makes training multiple lightgbm models in parallel undesirable, at least if the subprocesses are called from within a Python script. As this a common pattern of implementing concurrency, this appears to be a limitation which can hopefully be easily addressed and fixed.

Thanks!

Reproducible example

lgbm_affinity.py

import argparse
import lightgbm as lgb
import numpy as np
import os

np.random.seed(42)

def main(use_setaffinity: bool = False, use_taskset: bool = False):

    n = os.cpu_count()

    # Set affinity using ``os.sched_setaffinity``
    if use_setaffinity:
        os.sched_setaffinity(0, set(range(n - 1)))
        os.environ['OMP_NUM_THREADS'] = str(n - 1)  # Added after suggestion

    # Set affinity using ``taskset``
    if use_taskset:
        pid = os.getpid()
        command = f"taskset -cp 0-{n - 2} {pid}"
        os.system(command)
        os.environ['OMP_NUM_THREADS'] = str(n - 1)  # Added after suggestion

    # Generate a data set
    nrows, ncols = 1_000, 10
    X = np.random.normal(size=(nrows, ncols))
    y = X @ np.random.normal(size=ncols) + np.random.normal(size=nrows)

    lgb_train = lgb.Dataset(X, y)

    # Train model
    params = {
        "objective": "regression",
        "metric": "rmse",
        "num_leaves": 31,  # the default value
        "learning_rate": 0.05,
        "feature_fraction": 0.9,
        "bagging_fraction": 0.8,
        "bagging_freq": 5,
        "verbose": 0
    }
    lgb.train(params, lgb_train)

if __name__ == "__main__":

    parser = argparse.ArgumentParser()

    parser.add_argument(
        "--use-setaffinity", 
        dest="use_setaffinity", 
        action="store_true",
    )

    parser.add_argument(
        "--use-taskset", 
        dest="use_taskset", 
        action="store_true",
    )

    args = parser.parse_args()
    main(**vars(args))

lgbm_affinity.sh

time python lgbm_affinity.py > /dev/null 2>&1
time python lgbm_affinity.py  --use-setaffinity > /dev/null 2>&1
time python lgbm_affinity.py  --use-taskset > /dev/null 2>&1
time taskset -c 0-14 python lgbm_affinity.py > /dev/null 2>&1

Output

# Using all cores
real    0m1.821s
user    0m4.394s
sys     0m0.178s

# Using ``set_affinity`` from within the process
real    1m49.313s
user    25m44.344s
sys     0m1.109s

# Using ``taskset`` from within the process
real    1m48.820s
user    25m54.104s
sys     0m0.959s

# Using ``taskset`` before initializing the process
real    0m1.796s
user    0m4.135s
sys     0m0.203s

Environment info

LightGBM version or commit hash:

liblightgbm  4.5.0    cpu_h155599f_3  conda-forge
lightgbm     4.5.0    cpu_py_3        conda-forge

Command(s) you used to install LightGBM

micromamba install lightgbm

Other used packages:

numpy     1.26.4   py312heda63a1_0  conda-forge

The example was run on an AWS instance (ml.m5.4xlarge) with 16 cores.

Additional Comments

jmoralez commented 4 days ago

Hey @JohanLoknaQC, thanks for using LightGBM.

By default LightGBM uses all available threads on the machine unless you tell it otherwise. So in your examples you're submitting n tasks and assigning only n - 1 threads, so they have to fight each other to execute them. I think the easiest way to fix this is by doing something like os.environ['OMP_NUM_THREADS'] = str(n-1), that way you tell LightGBM to use the number of threads that you've limited the process to have.

JohanLoknaQC commented 1 day ago

Thanks a lot for the answer! However, after adding the suggested fix (see code above) the run-times remains virtually unchanged. It does seem like something else might be causing this additional run-time.

jmoralez commented 1 day ago

Sorry, I think that only works if provided through the command line. Can you please set the num_threads argument instead? e.g.

    params = {
        "objective": "regression",
        "metric": "rmse",
        "num_leaves": 31,  # the default value
        "learning_rate": 0.05,
        "feature_fraction": 0.9,
        "bagging_fraction": 0.8,
        "bagging_freq": 5,
        "verbose": 0,
        "num_threads": n - 1,  # <- set this
    }
JohanLoknaQC commented 11 hours ago

Thank you very much - this solved this issue.

Just for reference, it also worked when the affinities were set quite arbitrarily, e.g. 3-12. It therefore seems to a quite general solution. 👍