light gbm hangs when loading a model file in subprocess

assassin5615 commented 11 months ago

Description

train two models in the main process and save them into two model files. then use Multiprocessing.pool to load these two model files in subprocess, the subprocess will hang. part of the stack trace by using pyrasite-shell is as below

File "simple_lgbm.py", line 77, in predict x = lgb.Booster(model_file=file_name) File ".../lightgbm/basic.py", line 2087, in init _safe_call(_LIB.LGBM_BoosterCreateFromModelfile(

gdb shows more detail, the CreateBoosting function calls something like __kmp_api_GOMP_parallel_40_alias() and it hung at __kmp_suspend_64()

in light gbm FAQ, it mentioned that due to openmp bug, it could hang with multithreading and fork on linux. and suggest to use nthreads=1 to close multithreading. but setting nthreads=1 has no effect for lgb.Booster when loading model file.

is there a workaround or fix for this?

Reproducible example

the code is based on simple_example.py from light gbm repo.

# coding: utf-8
from pathlib import Path
from multiprocessing import get_context

import pandas as pd
from sklearn.metrics import mean_squared_error

import lightgbm as lgb

print('Loading data...')
# load or create your dataset
regression_example_dir = Path(__file__).absolute().parents[1] / 'regression'
df_train = pd.read_csv(str(regression_example_dir / 'regression.train'), header=None, sep='\t')
df_test = pd.read_csv(str(regression_example_dir / 'regression.test'), header=None, sep='\t')

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

# specify your configurations as a dict
params = dict(
    task='train',
    objective='regression',
    num_leaves=50,
    max_depth=6,
    n_jobs=10,
    min_data_in_leaf=100,
    feature_fraction=0.8,
    num_iterations=20,
    learning_rate=0.1,
    deterministic=True,
    metric=['rmse'],
    force_col_wise=True,
    verbose=-1
    )

print('Starting training...')
def train(file_name: str):
    # train
    gbm = lgb.train(params,
                lgb_train,
                num_boost_round=20,
                valid_sets=lgb_eval,
                callbacks=[lgb.early_stopping(stopping_rounds=5)])

    print('Saving model...')
    # save model to file
    gbm.save_model(file_name)

train('model1.txt')
train('model2.txt')

print('Starting predicting...')

def predict(file_name: str):
    # it hangs here
    x = lgb.Booster(model_file=file_name)
    y_pred = x.predict(X_test, num_iteration=x.best_iteration)
    rmse_test = mean_squared_error(y_test, y_pred) ** 0.5
    print(f'The RMSE of prediction is: {rmse_test}')

with get_context("fork").Pool(processes=2) as pool:
    for r in pool.imap_unordered(predict, ['model1.txt', 'model2.txt']):
        print(f'got result {r}')

Environment info

LightGBM version or commit hash: 4.0.0

Command(s) you used to install LightGBM

pip install lightgbm

Additional Comments

shiyu1994 commented 11 months ago

@assassin5615 Thanks for using LightGBM. Did you try setting the environment variable OMP_NUM_THREADS to 1?

assassin5615 commented 11 months ago

@shiyu1994 in my environment, OMP_NUM_THREADS is always 1 as I ran into other issues that requires set OMP_NUM_THREADS 1, so yes.

assassin5615 commented 10 months ago

I also tried to print the value of OMP_NUM_THREADS in the script, it's 1 before calling train and prediction.

microsoft / LightGBM