Why CUDA Performance Falls Short of CPU in LightGBM: Training and Inference Analysis

Description

During the process of conducting source code reading and testing on LightGBM using a binary classifier, it was observed that the GPU performance during training is notably lower than that of the CPU, specifically amounting to approximately one-tenth of the CPU performance. Moreover, during inference, the option to utilize any backend other than the CPU is not accessible. The GPU performance pertains to the device=cuda, which employs CudaTree, whereas the CPU backend refers to device=cpu, which utilizes Tree or its derivatives. The following questions arise:

Has this phenomenon been observed by others? If so, why is the CUDA backend (CUDATree) not employed during inference? Is it due to the operator characteristics being more advantageous for the CPU?

In which inference case would the CUDA backend definitely surpass the CPU backend?

Is there any available documentation that offers a comprehensive explanation of CUDA acceleration for LightGBM?

infer

train

onnx_runtime_c++(infer)

Reproducible example

import lightgbm as lgb
import numpy as np
import pandas as pd
import time
from sklearn.model_selection import train_test_split
from skl2onnx.common.data_types import FloatTensorType
import onnxmltools
import onnx
import onnxruntime as ort

X = np.random.rand(100, 5).astype(np.float32)
y = np.random.randint(0, 2, size=100)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

train_data = lgb.Dataset(X_train, label=y_train)

device = "cuda"

params = {
    'objective': 'binary',
    'metric': 'binary_logloss',
    'boosting_type': 'gbdt',
    'verbose': -1,
    'device' : device
}

model = lgb.train(params, train_data, num_boost_round=100)

x = (np.arange(15).reshape((-1, 5)) - 5).astype(np.float32) / 5

# mesure time in milliseconds
start_time = time.time()
preds = model.predict(x)
end_time = time.time()
print(preds)
print(f"lgb_{device} taken: {(end_time - start_time) * 1000:.3f} milliseconds")

initial_type = [('float_input', FloatTensorType([None, 5]))]

onnx_model = onnxmltools.convert_lightgbm(model, initial_types=initial_type, target_opset=12)

onnxmltools.utils.save_model(onnx_model, 'lightgbm_model.onnx')
print("ONNX model saved to lightgbm_model.onnx")

# run the model
model = onnx.load("lightgbm_model.onnx")

ort_session = ort.InferenceSession(model.SerializeToString())
# measure the time
start_time = time.time()
outputs = ort_session.run(None, {"float_input": x})
end_time = time.time()
print(outputs)
print(f"onnxruntime_c++ taken: {(end_time - start_time) * 1000:.3f} milliseconds")

Environment info

GPU: NVIDIA  A100 PCIE 80G
NVIDIA-SMI 525.89.02
Driver Version: 525.89.02
CUDA Version: 12.1 
CPU: Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz

lightgbm                          4.5.0.99
onnx                              1.15.0
onnx-graphsurgeon                 0.3.27
onnxconverter-common              1.14.0
onnxmltools                       1.12.0
onnxruntime                       1.16.3
skl2onnx                          1.17.0

Command(s) I used to install LightGBM

git clone --recursive https://github.com/microsoft/LightGBM
cd LightGBM
cmake -B build -S . -DUSE_CUDA=1 -G 'Ninja'
sh ./build-python.sh install --cuda
python test_lgb.py

Additional Comments

I am particularly interested in understanding the performance trade-offs between CPU and GPU backends in both training and inference stages. Any insights or documentation on this topic would be greatly appreciated.

Thanks for using LightGBM.

LightGBM does not currently have GPU-accelerated inference. You can see https://github.com/microsoft/LightGBM/issues/5854#issuecomment-2138659914 for some other options to try using GPUs to generate predictions with a LightGBM model.

Is there any available documentation that offers a comprehensive explanation of CUDA acceleration for LightGBM?

What does "comprehensive explanation" mean to you? Is there another library that has something like what you're looking for, and if so can you link to that?

I am particularly interested in understanding the performance trade-offs between CPU and GPU backends in both training and inference stages.

If that is true, you should try reducing your benchmarking code to just lightgbm and numpy / scipy ... removing all those ONNX libraries in the middle. Otherwise, it'll be difficult to understand the difference between performance characteristics of "LightGBM" and of "LightGBM used in a specific way via some ONNX libraries".

the GPU performance during training is notably lower than that of the CPU,

Two more points on claims like this:

If you want someone to help you investigate this, please be much more specific than "performance is lower"... is that runtime? % utilization of the processor? were the models trained roughly identical?
testing such things on a dataset with 100 rows and 5 columns is very unlikely to generate any generalizable findings. That is simply too small to explore the performance differences of different implementations. You should try larger and more complex datasets like some of those mentioned at https://github.com/microsoft/LightGBM/blob/c9d1ac7beac4426c8e636a392bde0f995d1ae8fb/docs/GPU-Performance.rst#performance-comparison.

@jameslamb , Thank you very much for your detailed response. I have gained valuable insights from your explanation:

I am now aware of two engineering solutions for accelerating LightGBM model inference: treelite and fil. I will delve deeper into these options to understand their capabilities and potential benefits.
I apologize for the vague description of "performance is lower," which may have caused confusion. This lack of clarity likely stems from my failure to clearly articulate the specific context of the reproducible example and the training times of lgb_cuda (600ms)and lgb_cpu (50ms) as depicted in the image. Both were tested on the same model, defined by the line model = lgb.train(params, train_data, num_boost_round=100), with the only difference being the device specification: device = "cuda" for the former and device = "cpu" for the latter. The reason you provided for the observed performance difference—that the scale might be too small—makes sense, and I will seek opportunities to conduct further experiments. Since training performance is not my primary focus, there is no need for further assistance in this regard. However, the information in the GPU-Performance.rst document you shared is highly informative and will be of significant value.
On a related note, I am curious about the absence of an official inference acceleration version. Could this be due to specific considerations, such as the CPU time already being sufficient for most scenarios? Is the repository's focus primarily on training optimization?

Thank you once again for your time and expertise.

the absence of an official inference acceleration version

There is just far more work to be done in this repo than people around to do it. @shiyu1994 has done most of the CUDA development in this project, maybe he can explain why training was a higher priority. I have some ideas about this but I'm not confident in them, and I don't want to misinform you.

microsoft / LightGBM