[GPU] Kernel crashed when using GPU

JuseTiZ commented 7 months ago

Description

Kernel crash occurs in Jupyter Notebook when running LightGBM with GPU support enabled on a small dataset (~5MB). This issue arises on a remote Linux server, not on a local setup.

Reproducible example

The following is related code:

mskf = StratifiedKFold(n_splits=5, shuffle=True, random_state=114514)

def objective(trial):
    # Define hyperparameters to tune
    param = {
        "objective": "regression",
        'metric': 'rmse',
        'verbosity': 2,
        'gpu_device_id': 0,
        'gpu_platform_id': 0,
        'device': 'gpu',
        'boosting_type': 'gbdt',
        'n_estimators': trial.suggest_int('n_estimators',500, 2000),
        'num_leaves': trial.suggest_int('num_leaves', 10, 200),
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.3), 
        'max_depth': trial.suggest_int('max_depth', 4, 20),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1e-3, 10.0),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1e-3, 10.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.3, 1),
        'subsample': trial.suggest_float('subsample', 0.4, 1)
    }
    scores = []

    for train_idx, valid_idx in mskf.split(X, y_bin):
        X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
        X_valid, y_valid = X.iloc[valid_idx], y.iloc[valid_idx]

        lgb_train = lgb.Dataset(X_train, y_train)
        lgb_eval = lgb.Dataset(X_valid, y_valid, reference=lgb_train)

        model = lgb.train(
                param, lgb_train, valid_sets=lgb_eval, callbacks=[lgb.early_stopping(stopping_rounds=30)]
            )

        valid_preds = model.predict(X_valid)
        oof_score = np.sqrt(mean_squared_log_error(y_valid, valid_preds))
        scores.append(oof_score)

    return np.mean(scores)

Output:

[I 2024-04-01 23:27:39,813] A new study created in memory with name: lgbm_model_training
[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2605
[LightGBM] [Info] Number of data points in the train set: 75833, number of used features: 15

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

Jupyter notebook log does not have very valuable information:

23:27:39.800 [info] Handle Execution of Cells 10 for ~/data/notebookb8bf9b7a37.ipynb
23:27:39.815 [info] Kernel acknowledged execution of cell 10 @ 1711985259815
23:27:41.989 [error] Disposing session as kernel process died ExitCode: undefined, Reason: 
23:27:41.989 [info] Dispose Kernel process 48669.
23:27:42.059 [info] End cell 10 execution after -1711985259.815s, completed @ undefined, started @ 1711985259815

The kernel crash happens specifically when the 'device': 'gpu' parameter is set in the LightGBM configuration. Disabling GPU support allows the code to run correctly.

Environment info

LightGBM version:

$ pip list | grep lightgbm
lightgbm          4.3.0.99

I followed the documentation to install LightGBM with GPU Support:

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
mkdir build
cd build
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
make -j4

cd ../
sh ./build-python.sh install --precompile

The issue seems related specifically to GPU utilization. Attempts to adjust gpu_device_id and gpu_platform_id settings did not resolve the problem. Is there a recommended approach to debug or fix this, or might there have been a misstep in the GPU installation or compilation process?

jameslamb commented 7 months ago

Thanks for using LightGBM and for the detailed report. Sorry you're running into this.

Could please provide a few more details that'd help us to investigate this?

type of GPU
specific operating system
build logs from running cmake ... and make ...

It'd also help if you could make this example more minimal. For example:

does this happen with all combinations of the hyperparameters you're searching over, or only some subset? If some subset, could you provide just those subsets?
could you try other strategies to make this more minimal?
- remove parameters one by one and see if you still get the error (e.g., remove reg_alpha and just accept LightGBM's defaults)
- remove StratifiedKFold or other splitting and perform every training run on the same dataset
- remove computation of evaluation scores (since this error is happening at training time)

Those sorts of things would help to narrow down the source of the problem.

JuseTiZ commented 7 months ago

@jameslamb Thanks for your quick reply, I will provide the relevant information:

type of GPU

R: NVIDIA GeForce RTX 3090. It works fine when training DNN or other ML models with GPU (like xgboost).

specific operating system

$ uname -a
Linux master 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ cat /etc/*release
CentOS Linux release 7.5.1804 (Core) 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

CentOS Linux release 7.5.1804 (Core) 
CentOS Linux release 7.5.1804 (Core)

build logs from running cmake ... and make ...

I removed the build folder and ran the following command:

$ rm -rf build
$ mkdir build
$ cd build
$ cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ ..
-- The C compiler identification is GNU 13.2.0
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /public/home/zj/mambaforge/envs/ncsvp/bin/x86_64-conda-linux-gnu-cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5") 
-- Found OpenMP_CXX: -fopenmp (found version "3.1") 
-- Found OpenMP: TRUE (found version "4.5")  
-- Looking for CL_VERSION_3_0
-- Looking for CL_VERSION_3_0 - found
-- Found OpenCL: /usr/local/cuda/lib64/libOpenCL.so (found version "3.0") 
-- OpenCL include directory: /usr/local/cuda/include
-- Found Boost: /public/home/zj/mambaforge/envs/ncsvp/lib/cmake/Boost-1.78.0/BoostConfig.cmake (found suitable version "1.78.0", minimum required is "1.56.0") found components: filesystem system 
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done (3.3s)
-- Generating done (0.1s)
-- Build files have been written to: /public/home/zj/tools/LightGBM/build

$ make -j4
[  2%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt.cpp.o
[  5%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt_model_text.cpp.o
[  7%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/boosting.cpp.o
[ 10%] Building CXX object CMakeFiles/lightgbm_capi_objs.dir/src/c_api.cpp.o
[ 12%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/gbdt_prediction.cpp.o
[ 15%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/prediction_early_stop.cpp.o
[ 17%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/boosting/sample_strategy.cpp.o
[ 20%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/bin.cpp.o
[ 23%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/config.cpp.o
[ 25%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/config_auto.cpp.o
[ 28%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/dataset.cpp.o
[ 28%] Built target lightgbm_capi_objs
[ 30%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/dataset_loader.cpp.o
[ 33%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/file_io.cpp.o
[ 35%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/json11.cpp.o
[ 38%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/metadata.cpp.o
[ 41%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/parser.cpp.o
[ 43%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/train_share_states.cpp.o
[ 46%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/io/tree.cpp.o
[ 48%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/metric/dcg_calculator.cpp.o
[ 51%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/metric/metric.cpp.o
[ 53%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/linker_topo.cpp.o
[ 56%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/linkers_mpi.cpp.o
[ 58%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/linkers_socket.cpp.o
[ 61%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/network/network.cpp.o
[ 64%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/objective/objective_function.cpp.o
[ 66%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/data_parallel_tree_learner.cpp.o
[ 69%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/feature_histogram.cpp.o
[ 71%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
[ 74%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/gpu_tree_learner.cpp.o
[ 76%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/gradient_discretizer.cpp.o
[ 79%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/linear_tree_learner.cpp.o
[ 82%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/serial_tree_learner.cpp.o
[ 84%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/tree_learner.cpp.o
[ 87%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/treelearner/voting_parallel_tree_learner.cpp.o
[ 89%] Building CXX object CMakeFiles/lightgbm_objs.dir/src/utils/openmp_wrapper.cpp.o
[ 89%] Built target lightgbm_objs
[ 92%] Linking CXX shared library /public/home/zj/tools/LightGBM/lib_lightgbm.so
[ 94%] Building CXX object CMakeFiles/lightgbm.dir/src/application/application.cpp.o
[ 97%] Building CXX object CMakeFiles/lightgbm.dir/src/main.cpp.o
[ 97%] Built target _lightgbm
[100%] Linking CXX executable /public/home/zj/tools/LightGBM/lightgbm
[100%] Built target lightgbm

$ cd ../
$ pip uninstall lightgbm
$ sh ./build-python.sh install --precompile
sh ./build-python.sh install --precompile
building lightgbm
Requirement already satisfied: build>=0.10.0 in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (1.2.1)
Requirement already satisfied: packaging>=19.1 in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (from build>=0.10.0) (24.0)
Requirement already satisfied: pyproject_hooks in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (from build>=0.10.0) (1.0.0)
Requirement already satisfied: tomli>=1.1.0 in /public/home/zj/mambaforge/envs/kaggle/lib/python3.10/site-packages (from build>=0.10.0) (2.0.1)
found pre-compiled lib_lightgbm.so
--- building sdist ---
* Creating isolated environment: venv+pip...
* Installing packages in isolated environment:
  - setuptools
* Getting build dependencies for sdist...
running egg_info
creating lightgbm.egg-info
writing lightgbm.egg-info/PKG-INFO
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing top-level names to lightgbm.egg-info/top_level.txt
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.dll' under directory 'lightgbm'
adding license file 'LICENSE'
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
* Building sdist...
running sdist
running egg_info
writing lightgbm.egg-info/PKG-INFO
writing dependency_links to lightgbm.egg-info/dependency_links.txt
writing requirements to lightgbm.egg-info/requires.txt
writing top-level names to lightgbm.egg-info/top_level.txt
reading manifest file 'lightgbm.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.dll' under directory 'lightgbm'
adding license file 'LICENSE'
writing manifest file 'lightgbm.egg-info/SOURCES.txt'
running check
creating lightgbm-4.3.0.99
creating lightgbm-4.3.0.99/lightgbm
creating lightgbm-4.3.0.99/lightgbm.egg-info
creating lightgbm-4.3.0.99/lightgbm/lib
copying files to lightgbm-4.3.0.99...
copying LICENSE -> lightgbm-4.3.0.99
copying MANIFEST.in -> lightgbm-4.3.0.99
copying README.rst -> lightgbm-4.3.0.99
copying pyproject.toml -> lightgbm-4.3.0.99
copying setup.cfg -> lightgbm-4.3.0.99
copying lightgbm/__init__.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/basic.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/callback.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/compat.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/dask.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/engine.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/libpath.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/plotting.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/py.typed -> lightgbm-4.3.0.99/lightgbm
copying lightgbm/sklearn.py -> lightgbm-4.3.0.99/lightgbm
copying lightgbm.egg-info/PKG-INFO -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/SOURCES.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/dependency_links.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/requires.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm.egg-info/top_level.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
copying lightgbm/lib/lib_lightgbm.so -> lightgbm-4.3.0.99/lightgbm/lib
copying lightgbm.egg-info/SOURCES.txt -> lightgbm-4.3.0.99/lightgbm.egg-info
Writing lightgbm-4.3.0.99/setup.cfg
Creating tar archive
removing 'lightgbm-4.3.0.99' (and everything under it)
Successfully built lightgbm-4.3.0.99.tar.gz
--- installing lightgbm ---
Looking in links: .
Processing ./lightgbm-4.3.0.99.tar.gz
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Installing backend dependencies ... done
  Preparing metadata (pyproject.toml) ... done
Collecting numpy (from lightgbm)
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.0/61.0 kB 360.5 kB/s eta 0:00:00
Collecting scipy (from lightgbm)
  Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 kB 1.6 MB/s eta 0:00:00
Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 1.7 MB/s eta 0:00:00
Downloading scipy-1.12.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.4/38.4 MB 1.8 MB/s eta 0:00:00
Building wheels for collected packages: lightgbm
  Building wheel for lightgbm (pyproject.toml) ... done
  Created wheel for lightgbm: filename=lightgbm-4.3.0.99-py3-none-any.whl size=3277584 sha256=3752165c110132d19d5b44d124f9636055f64195ec12c3c91f1e600395bf68be
  Stored in directory: /tmp/pip-ephem-wheel-cache-py4flilz/wheels/ab/ca/5d/8c248e7743594e1bd99a125aa24e0b01596f879dd6c7241e66
Successfully built lightgbm
Installing collected packages: numpy, scipy, lightgbm
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.12.0
    Uninstalling scipy-1.12.0:
      Successfully uninstalled scipy-1.12.0
Successfully installed lightgbm-4.3.0.99 numpy-1.26.4 scipy-1.12.0
cleaning up

does this happen with all combinations of the hyperparameters you're searching over, or only some subset? If some subset, could you provide just those subsets?

could you try other strategies to make this more minimal?

remove parameters one by one and see if you still get the error (e.g., remove reg_alpha and just accept LightGBM's defaults)

remove StratifiedKFold or other splitting and perform every training run on the same dataset

remove computation of evaluation scores (since this error is happening at training time)

I've tried setting only the most basic parameters:

params = {
    "metric": "rmse",
    "verbosity": 2,
    "device": "gpu",
    "boosting_type": "gbdt",
}

lgb_train = lgb.Dataset(X, y)
model = lgb.train(
        params, lgb_train, callbacks=[lgb.early_stopping(stopping_rounds=30)]
    )

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2612
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 15

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

The same when using the sklearn API:

params = {
    "metric": "rmse",
    "verbosity": 2,
    "device": "gpu",
    "boosting_type": "gbdt",
}

model = LGBMRegressor(**params)
model.fit(X, y)

[LightGBM] [Info] This is the GPU trainer!!
[LightGBM] [Info] Total Bins 2612
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 15

The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click [here](https://aka.ms/vscodeJupyterKernelCrash) for more info. View Jupyter [log](command:jupyter.viewOutput) for further details.

As before, replacing "device": "gpu" with "device": "cpu" makes it work properly.

params = {
    "metric": "rmse",
    "verbosity": 2,
    "device": "cpu",
    "boosting_type": "gbdt",
}

model = LGBMRegressor(**params)
model.fit(X, y)

[LightGBM] [Debug] Dataset::GetMultiBinFromAllFeatures: sparse rate 0.000043
[LightGBM] [Debug] init for col-wise cost 0.000014 seconds, init for row-wise cost 0.025067 seconds
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.104751 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2612
[LightGBM] [Info] Number of data points in the train set: 94792, number of used features: 15
[LightGBM] [Info] Start training from score 9.707233
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 6
[LightGBM] [Debug] Trained a tree with leaves = 31 and depth = 7
......

shiyu1994 commented 7 months ago

Thanks for reporting this. If you are using a single NVIDIA GPU for training, could you please try with our new CUDA version instead of the legacy GPU version (with -DUSE_CUDA=ON instead of -DUSE_GPU=ON)? It should be faster. https://lightgbm.readthedocs.io/en/latest/Installation-Guide.html#id20

JuseTiZ commented 7 months ago

@shiyu1994 Cmake failed when using -DUSE_CUDA=ON instead of -DUSE_GPU=ON:

$ cmake -DUSE_CUDA=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ -DCMAKE_C_COMPILER=/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc ..
-- The C compiler identification is GNU 12.1.0
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc - broken
CMake Error at /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/share/cmake-3.28/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p'

    Run Build Command(s): /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_5ca6c/fast
    /usr/bin/gmake  -f CMakeFiles/cmTC_5ca6c.dir/build.make CMakeFiles/cmTC_5ca6c.dir/build
    gmake[1]: Entering directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p'
    Building C object CMakeFiles/cmTC_5ca6c.dir/testCCompiler.c.o
    /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc   -march  -o CMakeFiles/cmTC_5ca6c.dir/testCCompiler.c.o -c /public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p/testCCompiler.c
    x86_64-conda-linux-gnu-cc: error: unrecognized command-line option '-march'
    gmake[1]: *** [CMakeFiles/cmTC_5ca6c.dir/testCCompiler.c.o] Error 1
    gmake[1]: Leaving directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-ukHP6p'
    gmake: *** [cmTC_5ca6c/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:32 (project)

-- Configuring incomplete, errors occurred!

I tried to downgrade gcc but this didn't help:

$ cmake -DUSE_CUDA=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ -DCMAKE_C_COMPILER=/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc ..
-- The C compiler identification is GNU 8.5.0
-- The CXX compiler identification is GNU 4.8.5
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - failed
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc
-- Check for working C compiler: /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc - broken
CMake Error at /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/share/cmake-3.28/Modules/CMakeTestCCompiler.cmake:67 (message):
  The C compiler

    "/public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc"

  is not able to compile a simple test program.

  It fails with the following output:

    Change Dir: '/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5'

    Run Build Command(s): /public/home/zj/tools/cmake-3.28.0-rc5-linux-x86_64/bin/cmake -E env VERBOSE=1 /usr/bin/gmake -f Makefile cmTC_88666/fast
    /usr/bin/gmake  -f CMakeFiles/cmTC_88666.dir/build.make CMakeFiles/cmTC_88666.dir/build
    gmake[1]: Entering directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5'
    Building C object CMakeFiles/cmTC_88666.dir/testCCompiler.c.o
    /public/home/zj/mambaforge/envs/kaggle/bin/x86_64-conda-linux-gnu-cc   -march  -o CMakeFiles/cmTC_88666.dir/testCCompiler.c.o -c /public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5/testCCompiler.c
    x86_64-conda-linux-gnu-cc: error: unrecognized command line option '-march'; did you mean '-march='?
    gmake[1]: *** [CMakeFiles/cmTC_88666.dir/testCCompiler.c.o] Error 1
    gmake[1]: Leaving directory `/public/home/zj/tools/LightGBM/build/CMakeFiles/CMakeScratch/TryCompile-w5DwZ5'
    gmake: *** [cmTC_88666/fast] Error 2

  CMake will not be able to correctly generate this project.
Call Stack (most recent call first):
  CMakeLists.txt:32 (project)

-- Configuring incomplete, errors occurred!

Is it because my gcc version is still wrong, or should I modify some files?

JuseTiZ commented 7 months ago

I removed the -march in CMakeCache.txt and installed CUDA version.

Replacing "device": "gpu" with "device": "cuda" makes lightgbm work well on GPU and was significantly accelerated.

Thanks for the advice.

microsoft / LightGBM