[python-package] [cuda] LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0)

NisuSan commented 4 months ago

Description

Execution of code failed with error

LightGBMError: Check failed: (split_indices_block_size_data_partition) > (0) at /usr/local/src/lightgbm/LightGBM/lightgbm-python/src/treelearner/cuda/cuda_data_partition.cpp, line 280 .

Reproducible example

Please, use this repo

Environment info

Docker image, based on nvidia/cuda:12.2.0-devel-ubuntu22.04 GPU: GeForce GTX 1060 CPU: AMD Ryzen 5 1600

Additional Comments

Same massage for classifiction and regression models

jameslamb commented 4 months ago

Thanks for using LightGBM.

The repo you've linked shows how you installed LightGBM, but not the code you're using to run LightGBM. Can you please share the exact code you're using? Are you able to provide a reproducible example (exact code that we could run which replicates the error)?

Although you didn't say it here, I know you're using the Python package specifically because of your comments on #6325. See these links for some examples of good reproducible examples in Python: https://github.com/microsoft/LightGBM/issues/6321#issuecomment-1948512259.

It's going to be very difficult to help you given only the details you've provided so far.

NisuSan commented 4 months ago

Can you please share the exact code you're using?

The repo I provided has code snippet inside the README Screenshot 2024-02-19-3

shiyu1994 commented 4 months ago

Thanks for reporting this issue. I think it should be quick to fix. I'm trying with your example.

shiyu1994 commented 4 months ago

Update the progress here.

I've built the docker image and try to reproduce the error. But the code runs successfully within the docker container. Here's the output. I modified the code the get the training loss.

import lightgbm as lgb
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=10_000)
dtrain = lgb.Dataset(X, label=y)
dval = lgb.Dataset(X, label=y)
bst = lgb.train(
        params={
           "objective": "regression",
           "device": "cuda",
           "verbose": 1,
       "metric": "l2"
        },
        train_set=dtrain,
    valid_sets=[dval],
    callbacks=[lgb.log_evaluation(period=1)],
    num_boost_round=5
)

[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Info] Total Bins 25500
[LightGBM] [Info] Number of data points in the train set: 10000, number of used features: 100
[LightGBM] [Info] Start training from score -0.370042
[1]     valid_0's l2: 22190.7
[2]     valid_0's l2: 19590.2
[3]     valid_0's l2: 17420.6
[4]     valid_0's l2: 15575
[5]     valid_0's l2: 13972.9

My GPU is V100, which is different with yours. One more thing that I would like to confirm is that, in the Dockerfile, there seems to be no LightGBM version being specified. So it builds the latest version from source. In your container where the error can be reproduced, can you provide the commit head of the LightGBM repo? That would be helpful for me to further identify the root cause.

Thanks.

NisuSan commented 4 months ago

@shiyu1994, Thanks for your response!

can you provide the commit head of the LightGBM repo? That would be helpful for me to further identify the root cause.

Sure, it's https://github.com/microsoft/LightGBM/commit/252828fd86627d7405021c3377534d6a8239dd69

NisuSan commented 2 months ago

@shiyu1994, Do you have any information about the problem?

microsoft / LightGBM