microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.68k stars 3.83k forks source link

[GPU] Kernel crashed when using GPU #6399

Closed JuseTiZ closed 7 months ago

JuseTiZ commented 7 months ago

Description

Kernel crash occurs in Jupyter Notebook when running LightGBM with GPU support enabled on a small dataset (~5MB). This issue arises on a remote Linux server, not on a local setup.

Reproducible example

The following is related code:

mskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=114514)

def objective(trial):
    # Define hyperparameters to tune
    param = {
        "objective": "regression",
        'metric': 'rmse',
        'verbosity': 2,
        'gpu_device_id': 0,
        'gpu_platform_id': 0,
        'device': 'gpu',
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators',500, 2000),
        'num_leaves': trial.suggest_int('num_leaves', 10, 200),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.3), 
        'max_depth': trial.suggest_int('max_depth', 4, 20),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.3, 1),
        'subsample': trial.suggest_float('subsample', 0.4, 1)
    }
    scores = []

    for train_idx, valid_idx in mskf.split(X, y_bin):
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_valid, y_valid = X.iloc[valid_idx], y.iloc[valid_idx]

        lgb_train = lgb.Dataset(X_train, y_train)
        lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

        model = lgb.train(
                param, lgb_train, valid_sets=lgb_eval, callbacks=[lgb.early_stopping(stopping_rounds=30)]
            )

        valid_preds = model.predict(X_valid)
        oof_score = np.sqrt(mean_squared_log_error(y_valid, valid_preds))
        scores.append(oof_score)

    return np.mean(scores)

Output:

[I 2024-04-01 23:27:39,813] A new study created in memory with name: lgbm_model_training
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2605
[LightGBM] [Info] Number of data points in the train set: 75833, number of used features: 15

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

Jupyter notebook log does not have very valuable information:

23:27:39.800 [info] Handle Execution of Cells 10 for ~/data/notebookb8bf9b7a37.ipynb
23:27:39.815 [info] Kernel acknowledged execution of cell 10 @ 1711985259815
23:27:41.989 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 
23:27:41.989 [info] Dispose Kernel process 48669.
23:27:42.059 [info] End cell 10 execution after -1711985259.815s, completed @ undefined, started @ 1711985259815

The kernel crash happens specifically when the 'device': 'gpu' parameter is set in the LightGBM configuration. Disabling GPU support allows the code to run correctly.

Environment info

LightGBM version:

$ pip list | grep lightgbm
lightgbm          4.3.0.99

I followed the documentation to install LightGBM with GPU Support:

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
make -j4
cd ../
sh ./build-python.sh install --precompile

The issue seems related specifically to GPU utilization. Attempts to adjust gpu_device_id and gpu_platform_id settings did not resolve the problem. Is there a recommended approach to debug or fix this, or might there have been a misstep in the GPU installation or compilation process?

jameslamb commented 7 months ago

Thanks for using LightGBM and for the detailed report. Sorry you're running into this.

Could please provide a few more details that'd help us to investigate this?

It'd also help if you could make this example more minimal. For example:

Those sorts of things would help to narrow down the source of the problem.

JuseTiZ commented 7 months ago

@jameslamb Thanks for your quick reply, I will provide the relevant information:

  • type of GPU

R: NVIDIA GeForce RTX 3090. It works fine when training DNN or other ML models with GPU (like xgboost).

  • specific operating system
$ uname -a
Linux master 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/*release
CentOS Linux release 7.5.1804 (Core) 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.5.1804 (Core) 
CentOS Linux release 7.5.1804 (Core) 
  • build logs from running cmake ... and make ...

I removed the build folder and ran the following command:

$ rm -rf build
$ mkdir build
$ cd build
$ cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
-- The C compiler identification is GNU 13.2.0
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /public/home/zj/mambaforge/envs/ncsvp/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "3.1") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for CL_VERSION_3_0
-- Looking for CL_VERSION_3_0 - found
-- Found OpenCL: /usr/local/cuda/lib64/libOpenCL.so (found version "3.0") 
-- OpenCL include directory: /usr/local/cuda/include
-- Found Boost: /public/home/zj/mambaforge/envs/ncsvp/lib/cmake/Boost-1.78.0/BoostConfig.cmake (found suitable version "1.78.0", minimum required is "1.56.0") found components: filesystem system 
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done (3.3s)
-- Generating done (0.1s)
-- Build files have been written to: /public/home/zj/tools/LightGBM/build
$ make -j4
[  2%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt.cpp.o
[  5%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt_model_text.cpp.o
[  7%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/boosting.cpp.o
[ 10%] Building CXX object CMakeFiles/lightgbm_capi_objs.dir/src/c_api.cpp.o
[ 12%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt_prediction.cpp.o
[ 15%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/prediction_early_stop.cpp.o
[ 17%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/sample_strategy.cpp.o
[ 20%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/bin.cpp.o
[ 23%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/config.cpp.o
[ 25%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/config_auto.cpp.o
[ 28%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/dataset.cpp.o
[ 28%] Built target lightgbm_capi_objs
[ 30%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/dataset_loader.cpp.o
[ 33%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/file_io.cpp.o
[ 35%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/json11.cpp.o
[ 38%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/metadata.cpp.o
[ 41%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/parser.cpp.o
[ 43%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/train_share_states.cpp.o
[ 46%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/tree.cpp.o
[ 48%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/metric/dcg_calculator.cpp.o
[ 51%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/metric/metric.cpp.o
[ 53%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/linker_topo.cpp.o
[ 56%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/linkers_mpi.cpp.o
[ 58%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/linkers_socket.cpp.o
[ 61%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/network.cpp.o
[ 64%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/objective/objective_function.cpp.o
[ 66%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/data_parallel_tree_learner.cpp.o
[ 69%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/feature_histogram.cpp.o
[ 71%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
[ 74%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/gpu_tree_learner.cpp.o
[ 76%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/gradient_discretizer.cpp.o
[ 79%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/linear_tree_learner.cpp.o
[ 82%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/serial_tree_learner.cpp.o
[ 84%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/tree_learner.cpp.o
[ 87%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/voting_parallel_tree_learner.cpp.o
[ 89%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/utils/openmp_wrapper.cpp.o
[ 89%] Built target lightgbm_objs
[ 92%] Linking CXX shared library /public/home/zj/tools/LightGBM/lib_lightgbm.so
[ 94%] Building CXX object CMakeFiles/lightgbm.dir/src/application/application.cpp.o
[ 97%] Building CXX object CMakeFiles/lightgbm.dir/src/main.cpp.o
[ 97%] Built target _lightgbm
[100%] Linking CXX executable /public/home/zj/tools/LightGBM/lightgbm
[100%] Built target lightgbm
$ cd ../
$ pip uninstall lightgbm
$ sh ./build-python.sh install --precompile
sh ./build-python.sh install --precompile
building lightgbm
Requirement already satisfied: build>=0.10.0 in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (1.2.1)
Requirement already satisfied: packaging>=19.1 in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (from build>=0.10.0) (24.0)
Requirement already satisfied: pyproject_hooks in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (from build>=0.10.0) (1.0.0)
Requirement already satisfied: tomli>=1.1.0 in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (from build>=0.10.0) (2.0.1)
found pre-compiled lib_lightgbm.so
--- building sdist ---
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools
* Getting build dependencies for sdist...
running egg_info
creating lightgbm.egg-info
writing lightgbm.egg-info/PKG-INFO
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing top-level names to lightgbm.egg-info/top_level.txt
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.dll' under directory 'lightgbm'
adding license file 'LICENSE'
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
* Building sdist...
running sdist
running egg_info
writing lightgbm.egg-info/PKG-INFO
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing top-level names to lightgbm.egg-info/top_level.txt
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.dll' under directory 'lightgbm'
adding license file 'LICENSE'
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
running check
creating lightgbm-4.3.0.99
creating lightgbm-4.3.0.99/lightgbm
creating lightgbm-4.3.0.99/lightgbm.egg-info
creating lightgbm-4.3.0.99/lightgbm/lib
copying files to lightgbm-4.3.0.99...
copying LICENSE -> lightgbm-4.3.0.99
copying MANIFEST.in -> lightgbm-4.3.0.99
copying README.rst -> lightgbm-4.3.0.99
copying pyproject.toml -> lightgbm-4.3.0.99
copying setup.cfg -> lightgbm-4.3.0.99
copying lightgbm/__init__.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/basic.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/callback.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/compat.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/dask.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/engine.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/libpath.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/plotting.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/py.typed -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/sklearn.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm.egg-info/PKG-INFO -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/SOURCES.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/dependency_links.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/requires.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/top_level.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm/lib/lib_lightgbm.so -> lightgbm-4.3.0.99/lightgbm/lib
copying lightgbm.egg-info/SOURCES.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
Writing lightgbm-4.3.0.99/setup.cfg
Creating tar archive
removing 'lightgbm-4.3.0.99' (and everything under it)
Successfully built lightgbm-4.3.0.99.tar.gz
--- installing lightgbm ---
Looking in links: .
Processing ./lightgbm-4.3.0.99.tar.gz
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting numpy (from lightgbm)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 360.5 kB/s eta 0:00:00
Collecting scipy (from lightgbm)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 kB 1.6 MB/s eta 0:00:00
Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 1.7 MB/s eta 0:00:00
Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.4/38.4 MB 1.8 MB/s eta 0:00:00
Building wheels for collected packages: lightgbm
  Building wheel for lightgbm (pyproject.toml) ... done
  Created wheel for lightgbm: filename=lightgbm-4.3.0.99-py3-none-any.whl size=3277584 sha256=3752165c110132d19d5b44d124f9636055f64195ec12c3c91f1e600395bf68be
  Stored in directory: /tmp/pip-ephem-wheel-cache-py4flilz/wheels/ab/ca/5d/8c248e7743594e1bd99a125aa24e0b01596f879dd6c7241e66
Successfully built lightgbm
Installing collected packages: numpy, scipy, lightgbm
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.12.0
    Uninstalling scipy-1.12.0:
      Successfully uninstalled scipy-1.12.0
Successfully installed lightgbm-4.3.0.99 numpy-1.26.4 scipy-1.12.0
cleaning up
  • does this happen with all combinations of the hyperparameters you're searching over, or only some subset? If some subset, could you provide just those subsets?
  • could you try other strategies to make this more minimal?

    • remove parameters one by one and see if you still get the error (e.g., remove reg_alpha and just accept LightGBM's defaults)
    • remove StratifiedKFold or other splitting and perform every training run on the same dataset
    • remove computation of evaluation scores (since this error is happening at training time)

I've tried setting only the most basic parameters:

params = {
    "metric": "rmse",
    "verbosity": 2,
    "device": "gpu",
    "boosting_type": "gbdt",
}

lgb_train = lgb.Dataset(X, y)
model = lgb.train(
        params, lgb_train, callbacks=[lgb.early_stopping(stopping_rounds=30)]
    )
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2612
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 15

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

The same when using the sklearn API:

params = {
    "metric": "rmse",
    "verbosity": 2,
    "device": "gpu",
    "boosting_type": "gbdt",
}

model = LGBMRegressor(**params)
model.fit(X, y)
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2612
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 15

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

As before, replacing "device": "gpu" with "device": "cpu" makes it work properly.

params = {
    "metric": "rmse",
    "verbosity": 2,
    "device": "cpu",
    "boosting_type": "gbdt",
}

model = LGBMRegressor(**params)
model.fit(X, y)
[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000043
[LightGBM] [Debug] init for col-wise cost 0.000014 seconds, init for row-wise cost 0.025067 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.104751 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2612
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 15
[LightGBM] [Info] Start training from score 9.707233
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
......
shiyu1994 commented 7 months ago

Thanks for reporting this. If you are using a single NVIDIA GPU for training, could you please try with our new CUDA version instead of the legacy GPU version (with -DUSE_CUDA=ON instead of -DUSE_GPU=ON)? It should be faster. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#id20

JuseTiZ commented 7 months ago

@shiyu1994 Cmake failed when using -DUSE_CUDA=ON instead of -DUSE_GPU=ON:

$ cmake -DUSE_CUDA=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ -DCMAKE_C_COMPILER=/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc ..
-- The C compiler identification is GNU 12.1.0
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc - broken
CMake Error at /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/share/cmake-3.28/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p'

    Run Build Command(s): /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_5ca6c/fast
    /usr/bin/gmake  -f CMakeFiles/cmTC_5ca6c.dir/build.make CMakeFiles/cmTC_5ca6c.dir/build
    gmake[1]: Entering directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p'
    Building C object CMakeFiles/cmTC_5ca6c.dir/testCCompiler.c.o
    /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc   -march  -o CMakeFiles/cmTC_5ca6c.dir/testCCompiler.c.o -c /public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p/testCCompiler.c
    x86_64-conda-linux-gnu-cc: error: unrecognized command-line option '-march'
    gmake[1]: *** [CMakeFiles/cmTC_5ca6c.dir/testCCompiler.c.o] Error 1
    gmake[1]: Leaving directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p'
    gmake: *** [cmTC_5ca6c/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:32 (project)

-- Configuring incomplete, errors occurred!

I tried to downgrade gcc but this didn't help:

$ cmake -DUSE_CUDA=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ -DCMAKE_C_COMPILER=/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc ..
-- The C compiler identification is GNU 8.5.0
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc - broken
CMake Error at /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/share/cmake-3.28/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5'

    Run Build Command(s): /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_88666/fast
    /usr/bin/gmake  -f CMakeFiles/cmTC_88666.dir/build.make CMakeFiles/cmTC_88666.dir/build
    gmake[1]: Entering directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5'
    Building C object CMakeFiles/cmTC_88666.dir/testCCompiler.c.o
    /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc   -march  -o CMakeFiles/cmTC_88666.dir/testCCompiler.c.o -c /public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5/testCCompiler.c
    x86_64-conda-linux-gnu-cc: error: unrecognized command line option '-march'; did you mean '-march='?
    gmake[1]: *** [CMakeFiles/cmTC_88666.dir/testCCompiler.c.o] Error 1
    gmake[1]: Leaving directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5'
    gmake: *** [cmTC_88666/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:32 (project)

-- Configuring incomplete, errors occurred!

Is it because my gcc version is still wrong, or should I modify some files?

JuseTiZ commented 7 months ago

I removed the -march in CMakeCache.txt and installed CUDA version.

Replacing "device": "gpu" with "device": "cuda" makes lightgbm work well on GPU and was significantly accelerated.

Thanks for the advice.