microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.51k stars 3.82k forks source link

[GPU] GPU trainer takes more time than CPU's #6525

Closed sbushmanov closed 1 month ago

sbushmanov commented 2 months ago
Ubuntu 22.04
lightgbm                  4.4.0.99                 pypi_0    pypi

I always lived under assumption GPU trainer is supposed to be faster than CPU one.

However, the below totally puzzled me:

%%timeit
params = {
    "objective": "multiclass",
    "num_class": 3,
    "metric": "multi_logloss",
    "device": "cpu", # <---
    "verbose": -1
}

model = lgb.train(
    params,
    train_set=dtrain,
    valid_sets=[dvalid],
    num_boost_round=3000,
    callbacks=[
        lgb.early_stopping(stopping_rounds=100),
    ],
)
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437579
503 ms ± 93.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

vs

%%timeit
params = {
    "objective": "multiclass",
    "num_class": 3,
    "metric": "multi_logloss",
    "device": "gpu",  # <---
    "verbose": -1
}

model = lgb.train(
    params,
    train_set=dtrain,
    valid_sets=[dvalid],
    num_boost_round=3000,
    callbacks=[
        lgb.early_stopping(stopping_rounds=100),
    ],
)
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437412
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437464
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437521
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437449
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437117
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437426
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437294
Training until validation scores don't improve for 100 rounds
Early stopping, best iteration is:
[35]    valid_0's multi_logloss: 0.437533
1.67 s ± 151 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Can somebody enlighten me on why the GPU slowdown is happening?

jameslamb commented 1 month ago

@sbushmanov the "device": "gpu" version is not being well-maintained right now. See my summary here: https://github.com/microsoft/LightGBM/issues/4946#issuecomment-2126170630.

I suspect it could be slower than CPU because, for example, it still does a significant amount of work on the CPU and involves some copying of data between host and device memory.

Do you have NVIDIA GPUs? If so, could you please try the "device": "cuda" version instead? It's better maintained, does more work on the GPU, and should be much faster.

Since it seems you're using conda to manage dependencies... you could use the package from conda-forge to get CUDA support.

# remove pip-installed lightgbm
# (based on the conda output you shared, I think you have this)
pip uninstall --yes lightgbm

# install lightgbm
conda install -c conda-forge --yes 'lightgbm>=4.4.0'

Then just change from "device": "gpu" to "device": "cuda" in your code.

sbushmanov commented 1 month ago

@jameslamb

Thanks for taking you time!

''cuda" is even worse than "gpu".

The timing is more or less like following: "cpu": 9 s "gpu": 15s "cuda: 52s

The "cuda" version installed with included script: sh ./build-python.sh install --cuda

jameslamb commented 1 month ago

What version of Python, CUDA, etc.?

I noticed you have early stopping enabled... is it being triggered at the exact same iteration across device types? There can be small numerical differences across them. (not using early stopping would be a more fair comparison of training time)

How much of that timing is split between Dataset construction and the actual training? You could call .construct() on the Dataset before training to estimate those timings.

Does the Dataset have categorical features? What dimensions (number of rows, number of columns?)

These reports about speed / timings are very difficult to investigate without such information and a reproducible example.

sbushmanov commented 1 month ago

@jameslamb

After experimenting with datasets of different sizes while keeping n_estimators constant, I found out on small datasets cpu booster outperforms cuda by an order of magnitude. However, with a a 10'000'000 rows 100 features a cuda one outperforms x3.

Thanks again for your time and support!

jameslamb commented 1 month ago

Ah interesting, thanks for that!

I think for some smaller datasets, it's definitely possible for the overhead introduced by copying between device (GPU) and host (CPU) memory to be large enough that it shows up as a longer absolute training time.