Closed JohanLoknaQC closed 11 hours ago
Hey @JohanLoknaQC, thanks for using LightGBM.
By default LightGBM uses all available threads on the machine unless you tell it otherwise. So in your examples you're submitting n tasks and assigning only n - 1 threads, so they have to fight each other to execute them. I think the easiest way to fix this is by doing something like os.environ['OMP_NUM_THREADS'] = str(n-1)
, that way you tell LightGBM to use the number of threads that you've limited the process to have.
Thanks a lot for the answer! However, after adding the suggested fix (see code above) the run-times remains virtually unchanged. It does seem like something else might be causing this additional run-time.
Sorry, I think that only works if provided through the command line. Can you please set the num_threads
argument instead? e.g.
params = {
"objective": "regression",
"metric": "rmse",
"num_leaves": 31, # the default value
"learning_rate": 0.05,
"feature_fraction": 0.9,
"bagging_fraction": 0.8,
"bagging_freq": 5,
"verbose": 0,
"num_threads": n - 1, # <- set this
}
Thank you very much - this solved this issue.
Just for reference, it also worked when the affinities were set quite arbitrarily, e.g. 3-12
. It therefore seems to a quite general solution. 👍
Description
There seems to be a clear issue related to how
lightgbm
handles resource sharing. When restricting the number of cores associated with a process, the runtime increases significantly.In the example provided below, the run time using all cores (0-15) is about 1.821 seconds. When restricting the process to all cores but one (0 - 14), the runtime increases to 109.31 seconds; more than a 60x increase. This only happens if the resource restriction is done from within the Python script. If the affinity is set beforehand using
taskset -c 0-14
the runtime is approximately the same, 1.796 seconds.This makes training multiple
lightgbm
models in parallel undesirable, at least if the subprocesses are called from within a Python script. As this a common pattern of implementing concurrency, this appears to be a limitation which can hopefully be easily addressed and fixed.Thanks!
Reproducible example
lgbm_affinity.py
lgbm_affinity.sh
Output
Environment info
LightGBM version or commit hash:
Command(s) you used to install LightGBM
Other used packages:
The example was run on an AWS instance (
ml.m5.4xlarge
) with 16 cores.Additional Comments