microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.65k stars 3.83k forks source link

CUDA: multi GPUs issue #3450

Closed Brian0906 closed 3 years ago

Brian0906 commented 4 years ago

I'm trying to use multi-GPUs to train the model. When I increase the number of data, this issue happens.

Everything goes well if the size of train set is less than 10000.

Operating System: Linux

CPU/GPU model: GPU

image

StrikerRUS commented 4 years ago

@Brian0906 Thanks a lot using experimental CUDA implementation! I observe the same error even with 1 GPU executing simple_example.py: https://github.com/microsoft/LightGBM/pull/3424#issuecomment-702189313.

Brian0906 commented 4 years ago

hi @StrikerRUS what's the size of trainset in the simple_example.py? I fount that only if the size of dataset is large, this issue happens.

StrikerRUS commented 4 years ago

Dataset is very small in the example: 7000x28 https://github.com/microsoft/LightGBM/blob/master/examples/regression/regression.train

austinpagan commented 3 years ago

Hi, @StrikerRUS and @Brian0906 ... is this the current issue for this problem? I'm one of the members of the team at IBM that ported this CUDA code, and I'm ready to try to reproduce this problem in my environment, if someone can teach me how, preferably with the most simple possible dataset.

My plan would be to fix this problem with the simplest possible dataset and then see if that fixes it in the original environment.

StrikerRUS commented 3 years ago

Hello @austinpagan !

Please refer to https://github.com/microsoft/LightGBM/pull/3428#issuecomment-747676987 for the self-contained repro via Docker. Please let me know if you need any additional details.

austinpagan commented 3 years ago

Let me apologize, @StrikerRUS, if my questions are seen as somehow inappropriate, as I'm rather new to the open source environment...

OK, so three things I'd like to understand, please: (1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario? (2) Is this problem only seen within Docker containers? (3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system. I know it's only happening when you run on CUDA, but we'd still like to understand the environment.

StrikerRUS commented 3 years ago

@austinpagan No need to apologize! Let me try to be more precise and do my best to answer your questions.

(2) Is this problem only seen within Docker containers?

No, this error can be reproduced w/ and w/o Docker. But I believe Docker is the easiest way to reproduce the error on your side as it ensures we are using the same environment.

(3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system.

We don't test Power systems. So we can be 100% sure only that X86 systems are affected.

(1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario?

  1. Get machine (preferably x86, because we cannot guaranty that the bug is reproduced on Power machines) with NVIDIA GPU (we've tested with Tesla M60 and Tesla P100, but I don't think it matters).
  2. Install Docker and NVIDIA Docker (nvidia-docker2) on your machine. https://docs.docker.com/engine/install/ubuntu/ and https://github.com/NVIDIA/nvidia-docker#getting-started can help with this.
  3. Run the following command in your console to get the latest sources of LightGBM:
    git clone --recursive https://github.com/microsoft/LightGBM
  4. Set environment variable named GITHUB_WORKSPACE to the path where you've downloaded LightGBM repository at step #2. It will be something like export GITHUB_WORKSPACE=/home/yourUserName/Documents/LightGBM.
  5. Run the following bunch of commands in your console:
    export ROOT_DOCKER_FOLDER=/LightGBM
    cat > docker.env <<EOF
    TASK=cuda
    COMPILER=gcc
    GITHUB_ACTIONS=true
    OS_NAME=linux
    BUILD_DIRECTORY=$ROOT_DOCKER_FOLDER
    CONDA_ENV=test-env
    PYTHON_VERSION=3.8
    EOF
    cat > docker-script.sh <<EOF
    export CONDA=\$HOME/miniconda
    export PATH=\$CONDA/bin:\$PATH
    nvidia-smi
    $ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
    $ROOT_DOCKER_FOLDER/.ci/test.sh
    source activate \$CONDA_ENV
    cd \$BUILD_DIRECTORY/examples/python-guide/
    python simple_example.py
    EOF
    sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh

    This will run simple_example.py inside NVIDIA Docker and let you reproduce the error.

Please feel free to ping me if something is still doesn't clear for you or you face any errors during preparing the repro.

austinpagan commented 3 years ago

So, since we're not conveniently set up with X86 boxes here, I decided to at least try to see if I could reproduce the problem on a Power system (since, after all, we did this exercise largely to allow folks on Power to access the GPUs, and did not contemplate that X86 folks would experiment with moving from OpenCL to direct CUDA).

INSIDE my docker container on my power box, I just ran the sample and the output looked like this:

(base) [root@58814263a195 python-guide]# python simple_example.py
Loading data...
Starting training...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[1] valid_0's l2: 0.244076  valid_0's l1: 0.493018
Training until validation scores don't improve for 5 rounds
[2] valid_0's l2: 0.240297  valid_0's l1: 0.489056
[3] valid_0's l2: 0.235733  valid_0's l1: 0.484089
[4] valid_0's l2: 0.231352  valid_0's l1: 0.479088
[5] valid_0's l2: 0.228939  valid_0's l1: 0.476159
[6] valid_0's l2: 0.22593   valid_0's l1: 0.472664
[7] valid_0's l2: 0.222515  valid_0's l1: 0.468425
[8] valid_0's l2: 0.219569  valid_0's l1: 0.464594
[9] valid_0's l2: 0.2168    valid_0's l1: 0.460795
[10]    valid_0's l2: 0.214371  valid_0's l1: 0.457276
[11]    valid_0's l2: 0.211988  valid_0's l1: 0.453923
[12]    valid_0's l2: 0.210264  valid_0's l1: 0.451235
[13]    valid_0's l2: 0.208926  valid_0's l1: 0.448992
[14]    valid_0's l2: 0.207403  valid_0's l1: 0.44634
[15]    valid_0's l2: 0.20601   valid_0's l1: 0.444016
[16]    valid_0's l2: 0.204447  valid_0's l1: 0.441362
[17]    valid_0's l2: 0.202712  valid_0's l1: 0.43891
[18]    valid_0's l2: 0.201066  valid_0's l1: 0.436192
[19]    valid_0's l2: 0.1998    valid_0's l1: 0.433884
[20]    valid_0's l2: 0.198063  valid_0's l1: 0.431129
Did not meet early stopping. Best iteration is:
[20]    valid_0's l2: 0.198063  valid_0's l1: 0.431129
Saving model...
Starting predicting...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
The rmse of prediction is: 0.4450426449744025
(base) [root@58814263a195 python-guide]# 

That message about "double precision calculations" is telling me we are using our code. Is this a good result, or is there an error here?

I also wanted to try a raw run on a lightgbm repository completely outside of the Docker universe, so on a different Power box, I cloned the repository and did the following commands:

cd LightGBM
mkdir build ; cd build
cmake ..
make -j4

That all seemed to work, so I went into the directory with the program and ran it. It gave me the following fundamental error:

[fossum@rain6p1 python-guide]$ pwd
/home/fossum/LightGBM/examples/python-guide
[fossum@rain6p1 python-guide]$ python3.8 simple_example.py
Traceback (most recent call last):
  File "simple_example.py", line 2, in <module>
    import lightgbm as lgb
ModuleNotFoundError: No module named 'lightgbm'
[fossum@rain6p1 python-guide]$ 

I naively went back to the LightGBM and tried "make install" but that was a non-starter.

Not being a python expert, I figured I'd stop here and report my status, so maybe you could give me some pointers...

StrikerRUS commented 3 years ago

@austinpagan Am I right that you got successful run of the simple_example.py script by following my guide from https://github.com/microsoft/LightGBM/issues/3450#issuecomment-754327830 but without step #0?

That message about "double precision calculations" is telling me we are using our code.

What do you mean by "our code"? CUDA implementation your team contributed to LightGBM repository or some your internal code from a fork?

austinpagan commented 3 years ago

Easy answer first: "our code" means CUDA implementation our team contributed to LightGBM repository. These warnings are only printed out when you run the code requesting the "cuda" device (as opposed to the OpenGL "gpu" device).

Yes, I ran "simple_example.py" following your guide, but skipping both steps 0 and 1, because we already have some Power boxes with functional docker containers, which already contained relatively recent clones of LightGBM, so I just went into one of them, and executed the "simple_example.py" program.

So, again, if you could help us figure out how to get the not-inside-a-container version running, we can hope to see the error there, and I can work on it.

Failing that, my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out. I could imagine this becoming an iterative process, and after a few iterations, we can determine why it's not working in your environment.

StrikerRUS commented 3 years ago

Thanks for your prompt response!

which already contained relatively recent clones of LightGBM

Could you be more precise and tell based on what commit your local LightGBM version was compiled? You can check it by running

git rev-parse HEAD

inside your local clone of the repo. Before taking any further steps we should agree on version we will debug with. Because by continuing with different versions of source files we are making the whole debug process pointless.

austinpagan commented 3 years ago

Fortunately for both of us, I'm a morning person. With the nine hour time difference between Москва and Austin, me being at my computer at 3PM your time will improve our productivity. To the extent that you can work a bit into your evening, that helps as well!

(base) [root@58814263a195 LightGBM]# pwd
/home/builder/fossum/LightGBM
(base) [root@58814263a195 LightGBM]# git rev-parse HEAD
5d79ff20d1b7ae226531e2445b17d747b253a637
(base) [root@58814263a195 LightGBM]# 

Now, if you want me to clone a fresh version of your choosing and try there, that will be fine, but you'll have to walk me through the process of building it to the point where my attempt to run the python test doesn't fail as I had indicated above on my other box. (My strengths are algorithms and debugging and c coding, not building and installing.)

austinpagan commented 3 years ago

I hope it's OK that we're more used to doing our work inside the docker container rather than issuing commands to the container from outside...

StrikerRUS commented 3 years ago

Now, if you want me to clone a fresh version of your choosing and try there, that will be fine,

No thanks, I believe that 5d79ff20d1b7ae226531e2445b17d747b253a637 is a good candidate for the debugging! Let's continue with this commit.

Given that simple code runs OK on POWER machine but fails on many x86 ones, it is starting to look like the bug affects only x86 architecture. However, it is quite strange because we are speaking about CUDA code executing on NVIDIA cards here...

I think we can follow your suggestion

my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out.

Let me compile LightGBM with the commit we agreed on and run the most verbose version of logs. Then I think you can suggest me some debug code injections and I'll recompile with them and get back with more info. I guess it will be the most efficient form of collaboration given that we do not have an easy access to POWER machines and you do not have a easy access to x86 ones. Please let me know WDYT.

austinpagan commented 3 years ago

I am happy with this plan!

I have a recommendation. If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...

Also, I will just let you know that my plan would be to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file, but let's see what your log reports have to say.

austinpagan commented 3 years ago

Two more things.
(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work? (2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...

StrikerRUS commented 3 years ago

OK, I have setup fresh and minimal environment to start debugging process.

If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...

What variable do you mean? I run a bash script inside a docker. It's common practise to ask Docker to run something. It can't be a problem. More proofs come from other reports of the same error. I believe users reported them use pretty different scripts and maybe do not use Docker at all. And they for sure do not use any variables that I use.

(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?

Yeah, that's why I've asked you to setup clean Docker environment. I was suspecting that you have some other version of LightGBM that works fine on your side. But now I'm quite confident with that. The thing is that that commit you've told me your version of LightGBM is compiled from simply cannot be compiled. CMake reports the following error.

...
[ 77%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/data_parallel_tree_learner.cpp.o
/LightGBM/src/treelearner/cuda_tree_learner.cpp: In member function 'LightGBM::Tree* LightGBM::CUDATreeLearner::Train(const score_t*, const score_t*)':
/LightGBM/src/treelearner/cuda_tree_learner.cpp:538:59: error: no matching function for call to 'LightGBM::CUDATreeLearner::Train(const score_t*&, const score_t*&)'
  538 |   Tree *ret = SerialTreeLearner::Train(gradients, hessians);
      |                                                           ^
In file included from /LightGBM/src/treelearner/cuda_tree_learner.h:25,
                 from /LightGBM/src/treelearner/cuda_tree_learner.cpp:6:
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note: candidate: 'virtual LightGBM::Tree* LightGBM::SerialTreeLearner::Train(const score_t*, const score_t*, bool)'
   78 |   Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
      |         ^~~~~
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note:   candidate expects 3 arguments, 2 provided
[ 80%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
make[3]: *** [CMakeFiles/_lightgbm.dir/build.make:407: CMakeFiles/_lightgbm.dir/src/treelearner/cuda_tree_learner.cpp.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/Makefile2:304: CMakeFiles/_lightgbm.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:311: CMakeFiles/_lightgbm.dir/rule] Error 2
make: *** [Makefile:274: _lightgbm] Error 2

This happens due to the following recent changes in LightGBM codebase: fcfd4132e6d40a22d52023396329c41fd3de4a42 (but those changes came before the commit we agreed on). So you should rebuild LightGBM to match the commit you've specified (and ensure that compilation fails), or tell me another (older) commit that your LightGBM version is really built from.

However, I went ahead and fixed the error which didn't allow to compile the library.

  1. fcdeb10535d340143f1fe54fc9785dbfd655469c
  2. 5eee55cc3e5a531e24530cbcd4f027a4b44ebcdd

These fixes allowed me to successfully compile the library with the commit you've mentioned (5d79ff20d1b7ae226531e2445b17d747b253a637).

Then I specified verbose=4 in simple_example.py to get debug logs from cpp code but unfortunately this didn't help. The error is still the same as before without no additional info.

2021-01-07T15:06:02.5788235Z Loading data...
2021-01-07T15:06:02.5789446Z 
2021-01-07T15:06:02.5789792Z Starting training...
2021-01-07T15:06:02.5790650Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5791552Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T15:06:02.5792427Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5798769Z Traceback (most recent call last):
2021-01-07T15:06:02.5799483Z   File "simple_example.py", line 38, in <module>
2021-01-07T15:06:02.5799965Z     early_stopping_rounds=5)
2021-01-07T15:06:02.5801170Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
2021-01-07T15:06:02.5801839Z     booster = Booster(params=params, train_set=train_set)
2021-01-07T15:06:02.5802709Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in __init__
2021-01-07T15:06:02.5803309Z     ctypes.byref(self.handle)))
2021-01-07T15:06:02.5804122Z   File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call
2021-01-07T15:06:02.5805012Z     raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
2021-01-07T15:06:02.5811139Z lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T15:06:02.5811782Z 
2021-01-07T15:06:05.3524322Z ##[error]Process completed with exit code 1.

So I will really appreciate your suggestions for

to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file

Speaking about to how re-compile and reinstall LightGBM, it is quite simple.

Commands to compile the dynamic library: https://github.com/microsoft/LightGBM/blob/5eee55cc3e5a531e24530cbcd4f027a4b44ebcdd/.github/workflows/cuda.yml#L76-L80 Command to install python package with just compiled library: https://github.com/microsoft/LightGBM/blob/5eee55cc3e5a531e24530cbcd4f027a4b44ebcdd/.github/workflows/cuda.yml#L81

Here is the full script that is used to install and setup Docker, clone repository, install CMake, Python and so on: https://github.com/microsoft/LightGBM/blob/test_cuda/.github/workflows/cuda.yml

(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...

Thanks! I setup the same Python version (3.6) to mimic your environment.

austinpagan commented 3 years ago

Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in src/treelearner/cuda_tree_learner.cpp...

austinpagan commented 3 years ago

Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:

(base) [root@58814263a195 treelearner]# sum cuda_tree_learner.cpp
36657    40
(base) [root@58814263a195 treelearner]# 
StrikerRUS commented 3 years ago

Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in

Thank you very much!

Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:

Have you applied two those fixes?

However, I went ahead and fixed the error which didn't allow to compile the library.

  1. fcdeb10
  2. 5eee55c
austinpagan commented 3 years ago

This may or may not end up being a "fix" if it helps, but it's useful information to have, and it's an easy change.

Please replace line 414 of src/treelearner/cuda_tree_learner with a different line, as follows:

Current line:

CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice, stream_[device_id]));

Suggested new line:

CUDASUCCESS_OR_FATAL(cudaMemcpy(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice));
StrikerRUS commented 3 years ago

Have you applied two those fixes?

Here what I'm getting after a patch:

Check sum of cuda_tree_learner.cpp
15848    40
austinpagan commented 3 years ago

Sorry, I don't know how to "apply" a fix.

austinpagan commented 3 years ago

Oh, never mind. I see now. Give me a couple minutes.

StrikerRUS commented 3 years ago

Suggested new line:

CUDASUCCESS_OR_FATAL(cudaMemcpy(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice));

Done! Building...

Sorry, I don't know how to "apply" a fix.

Simply change lines #537-538 in file src/treelearner/cuda_tree_learner.cpp from

Tree* CUDATreeLearner::Train(const score_t* gradients, const score_t *hessians) {
  Tree *ret = SerialTreeLearner::Train(gradients, hessians);

to

Tree* CUDATreeLearner::Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) {
  Tree *ret = SerialTreeLearner::Train(gradients, hessians, is_first_tree);

and line #48 in file src/treelearner/cuda_tree_learner.h from

    Tree* Train(const score_t* gradients, const score_t *hessians);

to

    Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree);
StrikerRUS commented 3 years ago

Done! Building...

Built! Unfortunately, no changes...

I'm afraid that I can't get more debug info due to the following our issue: #3641.

austinpagan commented 3 years ago

And, on my end, I can't get this code to build. So frustrating...

austinpagan commented 3 years ago

so, when you say "unfortunately, no changes" you mean the error reported is exactly the same, even with the change I proposed? That would be good news, because it means the error is actually in THAT CALL, and not in some previous call in the same "async thread"...

austinpagan commented 3 years ago

still claiming the problem is in line 414, right?

StrikerRUS commented 3 years ago

And, on my end, I can't get this code to build. So frustrating...

"this code" = code with these fixes https://github.com/microsoft/LightGBM/issues/3450#issuecomment-756209798?

Maybe you don't have all source files? Could you please try to re-clone the repo and only after that apply a fix?

git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
git checkout 5d79ff20d1b7ae226531e2445b17d747b253a637

<apply fixes to src/treelearner/cuda_tree_learner.h and src/treelearner/cuda_tree_learner.cpp>

so, when you say "unfortunately, no changes" you mean the error reported is exactly the same, even with the change I proposed?

Yes, absolutely right.

still claiming the problem is in line 414, right?

I guess so. At least the error comes from line 414...

austinpagan commented 3 years ago

so when I try to build, it's trying to get files from the "external_libs" directory, but in my clone, that directory just contains two empty sub-directories... any idea whether I'm missing some piece of the build that populates those directories? It looks like there's a "setup.py" file that mentions this directory, but I don't know who is supposed to execute that setup command...

austinpagan commented 3 years ago

We are investigating, but I figured it wouldn't hurt to ask you if you just know the answer off the top of your head...

StrikerRUS commented 3 years ago

that directory just contains two empty sub-directories...

Please make sure you don't forget --recursive flag during cloning the repo.

git clone --recursive https://github.com/microsoft/LightGBM.git
StrikerRUS commented 3 years ago

I've tried and can confirm that we can reproduce the error with simple command-line program. I simplified reproducible example so that it no longer requires Python installation. I believe it will help to sync environments.

Fortunately, the error is still the same. But we do not need a proxy of Python layer anymore. Now we run simple regression example from the repository directly via CLI version of LightGBM. Previously we run it via our Python-package.

Please take a look at greatly simplified script (no Python, no any env. variables) we run inside a Docker to reproduce the error: https://github.com/microsoft/LightGBM/blob/bcc3f291c8470bd680aa0c332cfaa3b1a0d01bdd/.github/workflows/cuda.yml#L43-L62

This script

StrikerRUS commented 3 years ago

And here are more verbose logs from the run after applying your proposed change in 414 line of src/treelearner/cuda_tree_learner.cpp file https://github.com/microsoft/LightGBM/issues/3450#issuecomment-756205729:

2021-01-07T18:54:57.1318861Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T18:54:57.1320390Z [LightGBM] [Info] Finished loading parameters
2021-01-07T18:54:57.1320991Z [LightGBM] [Debug] Loading train file...
2021-01-07T18:54:57.1405940Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:57.1597220Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T18:54:58.2787014Z [LightGBM] [Debug] Loading validation file #1...
2021-01-07T18:54:58.2879002Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:58.2964932Z [LightGBM] [Info] Finished loading data in 1.165807 seconds
2021-01-07T18:54:58.2965532Z [LightGBM] [Info] LightGBM using CUDA trainer with DP float!!
2021-01-07T18:54:58.2971585Z [LightGBM] [Info] Total Bins 6132
2021-01-07T18:54:58.2981032Z [LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
2021-01-07T18:54:58.2981689Z [LightGBM] [Debug] device_bin_size_ = 256
2021-01-07T18:54:58.2982161Z [LightGBM] [Debug] Resized feature masks
2021-01-07T18:54:58.2982684Z [LightGBM] [Debug] Memset pinned_feature_masks_
2021-01-07T18:54:58.2983679Z [LightGBM] [Debug] Allocated device_features_ addr=0x7ff5aaa00000 sz=196000
2021-01-07T18:54:58.2985727Z [LightGBM] [Debug] Memset device_data_indices_
2021-01-07T18:54:58.2991002Z [LightGBM] [Fatal] [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.2995493Z [LightGBM] [Debug] created device_subhistograms_: 0x7ff5ab000000
2021-01-07T18:54:58.3027139Z 
2021-01-07T18:54:58.3027684Z [LightGBM] [Debug] Started copying dense features from CPU to GPU
2021-01-07T18:54:58.3028247Z Met Exceptions:
2021-01-07T18:54:58.3028802Z [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.3029237Z 
2021-01-07T18:54:58.3030255Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 1
2021-01-07T18:54:58.3031103Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3031917Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3032773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3033581Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3034408Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3035216Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3036038Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3036843Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3037660Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3038459Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3039263Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3040077Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3041108Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3041993Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3042794Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3043607Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3044405Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3045225Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3046029Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3046847Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3047646Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3048447Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3049264Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3050082Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3050902Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3051702Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3052521Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3053318Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3054138Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3054939Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3055754Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3056550Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3057347Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3058351Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3059161Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3059976Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3060773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3061589Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3062382Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:59.6885338Z ##[error]Process completed with exit code 255.

Hope they will help somehow. Please let me know how can I modify the source code of CUDA treelearner more to get useful info that will help to narrow the problem.

austinpagan commented 3 years ago

So, sorry for the delay in response. My colleague seems to be close to figuring out how we can reproduce this problem on Power systems. You can rest easy for now, because if he is successful, we can handle it from here on out...

StrikerRUS commented 3 years ago

Oh, great news! Thank you very much!

austinpagan commented 3 years ago

And, it is confirmed. On my power system, I get this now:

(base) [root@58814263a195 python-guide]# python simple_example.py Loading data... Starting training... [LightGBM] [Warning] CUDA currently requires double precision calculations. [LightGBM] [Warning] Using sparse features with CUDA is currently not supported. [LightGBM] [Warning] CUDA currently requires double precision calculations. [LightGBM] [Fatal] [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414

Traceback (most recent call last): File "simple_example.py", line 39, in early_stopping_rounds=5) File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train booster = Booster(params=params, train_set=train_set) File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in init ctypes.byref(self.handle))) File "/opt/anaconda3/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8')) lightgbm.basic.LightGBMError: [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414

(base) [root@58814263a195 python-guide]#

austinpagan commented 3 years ago

So, again, I can pursue this now, without pestering you. Wish me luck!

StrikerRUS commented 3 years ago

Hah, in any other situation people shouldn't be happy when someone another gets errors from software, but right now I'm happy! 😄 Hope it won't be hard to find a root cause for you.

Again, if you are not comfortable using Python, please check this my message where I show how to reproduce the same error with LightGBM's executable binary from command line interface. Feel free to ask for any details if something is not clear.

ChipKerchner commented 3 years ago

@StrikerRUS The problem is that the non-CUDA vector allocators were changed to use kAlignedSize with VirtualFileWriter::AlignedSize between 3.0 and 3.1. Therefore the CUDA vector allocator wasn't allocating enough space in some instances. Here is a purpose change to fix the CUDA vector allocator. simple_example.py and advanced_example.py work with this change.

diff --git a/include/LightGBM/cuda/vector_cudahost.h b/include/LightGBM/cuda/vector_cudahost.h
index 03db338..46698d0 100644
--- a/include/LightGBM/cuda/vector_cudahost.h
+++ b/include/LightGBM/cuda/vector_cudahost.h
@@ -42,6 +42,7 @@ struct CHAllocator {
   T* allocate(std::size_t n) {
     T* ptr;
     if (n == 0) return NULL;
+    n = (n + kAlignedSize - 1) & -kAlignedSize;
     #ifdef USE_CUDA
       if (LGBM_config_::current_device == lgbm_device_cuda) {
         cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable);
StrikerRUS commented 3 years ago

@austinpagan @ChipKerchner Awesome! I can confirm that this fix helps to get rid from errors on X86 machines as well. Many thanks for the research you've done and providing the fix!

Would you like to contribute this fix from your account so that GitHub will associate fixing commit with you? Or maybe it's not very important for you and you prefer to let do this to someone from LightGBM maintainers to save your time?

StrikerRUS commented 3 years ago

Fixed via #3748.

ChipKerchner commented 3 years ago

@StrikerRUS This should fix the remaining CUDA failures. Let me know if you see any issues.

diff --git a/src/treelearner/cuda_tree_learner.cpp b/src/treelearner/cuda_tree_learner.cpp
index 16569ee..4495578 100644
--- a/src/treelearner/cuda_tree_learner.cpp
+++ b/src/treelearner/cuda_tree_learner.cpp
@@ -408,7 +408,7 @@ void CUDATreeLearner::copyDenseFeature() {
     // looking for dword_features_ non-sparse feature-groups
     if (!train_data_->IsMultiGroup(i)) {
       dense_feature_group_map_.push_back(i);
-      auto sizes_in_byte = train_data_->FeatureGroupSizesInByte(i);
+      auto sizes_in_byte = std::min(train_data_->FeatureGroupSizesInByte(i), static_cast<size_t>(num_data_));
       void* tmp_data = train_data_->FeatureGroupData(i);
       Log::Debug("Started copying dense features from CPU to GPU - 2");
       CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cu
@@ -534,8 +534,8 @@ void CUDATreeLearner::InitGPU(int num_gpu) {
   copyDenseFeature();
 }
StrikerRUS commented 3 years ago

@ChipKerchner After applying this fix all but two tests are passed! Very nice indeed!

Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz environment issue as in https://github.com/microsoft/LightGBM/pull/3672#issuecomment-757642931.

============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items

../tests/c_api_test/test_.py ..                                          [  0%]
../tests/python_package_test/test_basic.py .............                 [  6%]
../tests/python_package_test/test_consistency.py ......                  [  8%]
../tests/python_package_test/test_dask.py ............................   [ 20%]
../tests/python_package_test/test_dual.py s                              [ 21%]
../tests/python_package_test/test_engine.py ............................ [ 32%]
.......................................                                  [ 49%]
../tests/python_package_test/test_plotting.py F...F                      [ 51%]
../tests/python_package_test/test_sklearn.py ........................... [ 62%]
......x.........................................x....................... [ 92%]
.................                                                        [100%]
= 2 failed, 233 passed, 1 skipped, 2 xfailed, 74 warnings in 195.32s (0:03:15) =
ChipKerchner commented 3 years ago

Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz environment issue as in #3672 (comment).

============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items

../tests/python_package_test/test_plotting.py F...F                      [ 51%]

In my branch , test_plotting passes all tests.

python -m unittest tests/python_package_test/test_plotting.py
.../test_plotting.py:156: UserWarning: More than one metric available, picking one to plot.
  ax0 = lgb.plot_metric(evals_result0)
..s
----------------------------------------------------------------------
Ran 5 tests in 1.956s

OK (skipped=2)
jameslamb commented 3 years ago

woo! Thanks @ChipKerchner . Like @StrikerRUS mentioned, I think it's very very unlikely that the two failing plotting tests are related to your changes. I found in https://github.com/microsoft/LightGBM/pull/3672#issuecomment-757642931 that there might be some issues with the conda-forge recipe for graphviz.

StrikerRUS commented 3 years ago

Yeah, thanks for the info about tests @ChipKerchner ! I'm 100% sure that 2 failing plotting tests on our side is related to our environment. And I'll fix this environment issue during working on making CUDA builds run on a regular basis.

austinpagan commented 3 years ago

@StrikerRUS: Look at you, making all our dreams come true!!! Thank you!

StrikerRUS commented 3 years ago

@austinpagan Thanks a lot for all your hard work!