Closed Brian0906 closed 3 years ago
@Brian0906 Thanks a lot using experimental CUDA implementation! I observe the same error even with 1 GPU executing simple_example.py
: https://github.com/microsoft/LightGBM/pull/3424#issuecomment-702189313.
hi @StrikerRUS what's the size of trainset in the simple_example.py? I fount that only if the size of dataset is large, this issue happens.
Dataset is very small in the example: 7000x28 https://github.com/microsoft/LightGBM/blob/master/examples/regression/regression.train
Hi, @StrikerRUS and @Brian0906 ... is this the current issue for this problem? I'm one of the members of the team at IBM that ported this CUDA code, and I'm ready to try to reproduce this problem in my environment, if someone can teach me how, preferably with the most simple possible dataset.
My plan would be to fix this problem with the simplest possible dataset and then see if that fixes it in the original environment.
Hello @austinpagan !
Please refer to https://github.com/microsoft/LightGBM/pull/3428#issuecomment-747676987 for the self-contained repro via Docker. Please let me know if you need any additional details.
Let me apologize, @StrikerRUS, if my questions are seen as somehow inappropriate, as I'm rather new to the open source environment...
OK, so three things I'd like to understand, please: (1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario? (2) Is this problem only seen within Docker containers? (3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system. I know it's only happening when you run on CUDA, but we'd still like to understand the environment.
@austinpagan No need to apologize! Let me try to be more precise and do my best to answer your questions.
(2) Is this problem only seen within Docker containers?
No, this error can be reproduced w/ and w/o Docker. But I believe Docker is the easiest way to reproduce the error on your side as it ensures we are using the same environment.
(3) I cannot see in this documentation whether you found this problem on a Power system or on an X86 system.
We don't test Power systems. So we can be 100% sure only that X86 systems are affected.
(1) I could not find, in your link, any guidance that I could understand, can you point me more specifically to the recreate scenario?
nvidia-docker2
) on your machine. https://docs.docker.com/engine/install/ubuntu/ and https://github.com/NVIDIA/nvidia-docker#getting-started can help with this.git clone --recursive https://github.com/microsoft/LightGBM
GITHUB_WORKSPACE
to the path where you've downloaded LightGBM repository at step #2
. It will be something like export GITHUB_WORKSPACE=/home/yourUserName/Documents/LightGBM
.export ROOT_DOCKER_FOLDER=/LightGBM
cat > docker.env <<EOF
TASK=cuda
COMPILER=gcc
GITHUB_ACTIONS=true
OS_NAME=linux
BUILD_DIRECTORY=$ROOT_DOCKER_FOLDER
CONDA_ENV=test-env
PYTHON_VERSION=3.8
EOF
cat > docker-script.sh <<EOF
export CONDA=\$HOME/miniconda
export PATH=\$CONDA/bin:\$PATH
nvidia-smi
$ROOT_DOCKER_FOLDER/.ci/setup.sh || exit -1
$ROOT_DOCKER_FOLDER/.ci/test.sh
source activate \$CONDA_ENV
cd \$BUILD_DIRECTORY/examples/python-guide/
python simple_example.py
EOF
sudo docker run --env-file docker.env -v "$GITHUB_WORKSPACE":"$ROOT_DOCKER_FOLDER" --rm --gpus all nvidia/cuda:11.0-devel-ubuntu20.04 /bin/bash $ROOT_DOCKER_FOLDER/docker-script.sh
This will run simple_example.py inside NVIDIA Docker and let you reproduce the error.
Please feel free to ping me if something is still doesn't clear for you or you face any errors during preparing the repro.
So, since we're not conveniently set up with X86 boxes here, I decided to at least try to see if I could reproduce the problem on a Power system (since, after all, we did this exercise largely to allow folks on Power to access the GPUs, and did not contemplate that X86 folks would experiment with moving from OpenCL to direct CUDA).
INSIDE my docker container on my power box, I just ran the sample and the output looked like this:
(base) [root@58814263a195 python-guide]# python simple_example.py
Loading data...
Starting training...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[LightGBM] [Warning] CUDA currently requires double precision calculations.
[1] valid_0's l2: 0.244076 valid_0's l1: 0.493018
Training until validation scores don't improve for 5 rounds
[2] valid_0's l2: 0.240297 valid_0's l1: 0.489056
[3] valid_0's l2: 0.235733 valid_0's l1: 0.484089
[4] valid_0's l2: 0.231352 valid_0's l1: 0.479088
[5] valid_0's l2: 0.228939 valid_0's l1: 0.476159
[6] valid_0's l2: 0.22593 valid_0's l1: 0.472664
[7] valid_0's l2: 0.222515 valid_0's l1: 0.468425
[8] valid_0's l2: 0.219569 valid_0's l1: 0.464594
[9] valid_0's l2: 0.2168 valid_0's l1: 0.460795
[10] valid_0's l2: 0.214371 valid_0's l1: 0.457276
[11] valid_0's l2: 0.211988 valid_0's l1: 0.453923
[12] valid_0's l2: 0.210264 valid_0's l1: 0.451235
[13] valid_0's l2: 0.208926 valid_0's l1: 0.448992
[14] valid_0's l2: 0.207403 valid_0's l1: 0.44634
[15] valid_0's l2: 0.20601 valid_0's l1: 0.444016
[16] valid_0's l2: 0.204447 valid_0's l1: 0.441362
[17] valid_0's l2: 0.202712 valid_0's l1: 0.43891
[18] valid_0's l2: 0.201066 valid_0's l1: 0.436192
[19] valid_0's l2: 0.1998 valid_0's l1: 0.433884
[20] valid_0's l2: 0.198063 valid_0's l1: 0.431129
Did not meet early stopping. Best iteration is:
[20] valid_0's l2: 0.198063 valid_0's l1: 0.431129
Saving model...
Starting predicting...
[LightGBM] [Warning] CUDA currently requires double precision calculations.
The rmse of prediction is: 0.4450426449744025
(base) [root@58814263a195 python-guide]#
That message about "double precision calculations" is telling me we are using our code. Is this a good result, or is there an error here?
I also wanted to try a raw run on a lightgbm repository completely outside of the Docker universe, so on a different Power box, I cloned the repository and did the following commands:
cd LightGBM
mkdir build ; cd build
cmake ..
make -j4
That all seemed to work, so I went into the directory with the program and ran it. It gave me the following fundamental error:
[fossum@rain6p1 python-guide]$ pwd
/home/fossum/LightGBM/examples/python-guide
[fossum@rain6p1 python-guide]$ python3.8 simple_example.py
Traceback (most recent call last):
File "simple_example.py", line 2, in <module>
import lightgbm as lgb
ModuleNotFoundError: No module named 'lightgbm'
[fossum@rain6p1 python-guide]$
I naively went back to the LightGBM and tried "make install" but that was a non-starter.
Not being a python expert, I figured I'd stop here and report my status, so maybe you could give me some pointers...
@austinpagan Am I right that you got successful run of the simple_example.py
script by following my guide from https://github.com/microsoft/LightGBM/issues/3450#issuecomment-754327830 but without step #0
?
That message about "double precision calculations" is telling me we are using our code.
What do you mean by "our code"? CUDA implementation your team contributed to LightGBM repository or some your internal code from a fork?
Easy answer first: "our code" means CUDA implementation our team contributed to LightGBM repository. These warnings are only printed out when you run the code requesting the "cuda" device (as opposed to the OpenGL "gpu" device).
Yes, I ran "simple_example.py" following your guide, but skipping both steps 0 and 1, because we already have some Power boxes with functional docker containers, which already contained relatively recent clones of LightGBM, so I just went into one of them, and executed the "simple_example.py" program.
So, again, if you could help us figure out how to get the not-inside-a-container version running, we can hope to see the error there, and I can work on it.
Failing that, my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out. I could imagine this becoming an iterative process, and after a few iterations, we can determine why it's not working in your environment.
Thanks for your prompt response!
which already contained relatively recent clones of LightGBM
Could you be more precise and tell based on what commit your local LightGBM version was compiled? You can check it by running
git rev-parse HEAD
inside your local clone of the repo. Before taking any further steps we should agree on version we will debug with. Because by continuing with different versions of source files we are making the whole debug process pointless.
Fortunately for both of us, I'm a morning person. With the nine hour time difference between Москва and Austin, me being at my computer at 3PM your time will improve our productivity. To the extent that you can work a bit into your evening, that helps as well!
(base) [root@58814263a195 LightGBM]# pwd
/home/builder/fossum/LightGBM
(base) [root@58814263a195 LightGBM]# git rev-parse HEAD
5d79ff20d1b7ae226531e2445b17d747b253a637
(base) [root@58814263a195 LightGBM]#
Now, if you want me to clone a fresh version of your choosing and try there, that will be fine, but you'll have to walk me through the process of building it to the point where my attempt to run the python test doesn't fail as I had indicated above on my other box. (My strengths are algorithms and debugging and c coding, not building and installing.)
I hope it's OK that we're more used to doing our work inside the docker container rather than issuing commands to the container from outside...
Now, if you want me to clone a fresh version of your choosing and try there, that will be fine,
No thanks, I believe that 5d79ff20d1b7ae226531e2445b17d747b253a637 is a good candidate for the debugging! Let's continue with this commit.
Given that simple code runs OK on POWER machine but fails on many x86 ones, it is starting to look like the bug affects only x86 architecture. However, it is quite strange because we are speaking about CUDA code executing on NVIDIA cards here...
I think we can follow your suggestion
my backup suggestion COULD be that I could provide you with a debug version of one source file from our LightGBM, and you could compile that into your favorite local branch of LightGBM, and see what interesting debug data it prints out.
Let me compile LightGBM with the commit we agreed on and run the most verbose version of logs. Then I think you can suggest me some debug code injections and I'll recompile with them and get back with more info. I guess it will be the most efficient form of collaboration given that we do not have an easy access to POWER machines and you do not have a easy access to x86 ones. Please let me know WDYT.
I am happy with this plan!
I have a recommendation. If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...
Also, I will just let you know that my plan would be to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file, but let's see what your log reports have to say.
Two more things.
(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?
(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...
OK, I have setup fresh and minimal environment to start debugging process.
If you can try to run your "most verbose" test INSIDE the container as I do, as opposed to running it as a command from outside the container, we can remove that variable as well. I have a dark suspicion that this may be a problem with Docker not doing a good job when GPUs are involved...
What variable do you mean? I run a bash script inside a docker. It's common practise to ask Docker to run something. It can't be a problem. More proofs come from other reports of the same error. I believe users reported them use pretty different scripts and maybe do not use Docker at all. And they for sure do not use any variables that I use.
(1) can you teach me how to PROPERLY rebuild LightGBM and the examples so that I can be sure I'm not just running some old binary that HAPPENS to work?
Yeah, that's why I've asked you to setup clean Docker environment. I was suspecting that you have some other version of LightGBM that works fine on your side. But now I'm quite confident with that. The thing is that that commit you've told me your version of LightGBM is compiled from simply cannot be compiled. CMake reports the following error.
...
[ 77%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/data_parallel_tree_learner.cpp.o
/LightGBM/src/treelearner/cuda_tree_learner.cpp: In member function 'LightGBM::Tree* LightGBM::CUDATreeLearner::Train(const score_t*, const score_t*)':
/LightGBM/src/treelearner/cuda_tree_learner.cpp:538:59: error: no matching function for call to 'LightGBM::CUDATreeLearner::Train(const score_t*&, const score_t*&)'
538 | Tree *ret = SerialTreeLearner::Train(gradients, hessians);
| ^
In file included from /LightGBM/src/treelearner/cuda_tree_learner.h:25,
from /LightGBM/src/treelearner/cuda_tree_learner.cpp:6:
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note: candidate: 'virtual LightGBM::Tree* LightGBM::SerialTreeLearner::Train(const score_t*, const score_t*, bool)'
78 | Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) override;
| ^~~~~
/LightGBM/src/treelearner/serial_tree_learner.h:78:9: note: candidate expects 3 arguments, 2 provided
[ 80%] Building CXX object CMakeFiles/_lightgbm.dir/src/treelearner/feature_parallel_tree_learner.cpp.o
make[3]: *** [CMakeFiles/_lightgbm.dir/build.make:407: CMakeFiles/_lightgbm.dir/src/treelearner/cuda_tree_learner.cpp.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [CMakeFiles/Makefile2:304: CMakeFiles/_lightgbm.dir/all] Error 2
make[1]: *** [CMakeFiles/Makefile2:311: CMakeFiles/_lightgbm.dir/rule] Error 2
make: *** [Makefile:274: _lightgbm] Error 2
This happens due to the following recent changes in LightGBM codebase: fcfd4132e6d40a22d52023396329c41fd3de4a42 (but those changes came before the commit we agreed on). So you should rebuild LightGBM to match the commit you've specified (and ensure that compilation fails), or tell me another (older) commit that your LightGBM version is really built from.
However, I went ahead and fixed the error which didn't allow to compile the library.
These fixes allowed me to successfully compile the library with the commit you've mentioned (5d79ff20d1b7ae226531e2445b17d747b253a637).
Then I specified verbose=4
in simple_example.py
to get debug logs from cpp code but unfortunately this didn't help. The error is still the same as before without no additional info.
2021-01-07T15:06:02.5788235Z Loading data...
2021-01-07T15:06:02.5789446Z
2021-01-07T15:06:02.5789792Z Starting training...
2021-01-07T15:06:02.5790650Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5791552Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T15:06:02.5792427Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T15:06:02.5798769Z Traceback (most recent call last):
2021-01-07T15:06:02.5799483Z File "simple_example.py", line 38, in <module>
2021-01-07T15:06:02.5799965Z early_stopping_rounds=5)
2021-01-07T15:06:02.5801170Z File "/root/.local/lib/python3.6/site-packages/lightgbm/engine.py", line 228, in train
2021-01-07T15:06:02.5801839Z booster = Booster(params=params, train_set=train_set)
2021-01-07T15:06:02.5802709Z File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 2076, in __init__
2021-01-07T15:06:02.5803309Z ctypes.byref(self.handle)))
2021-01-07T15:06:02.5804122Z File "/root/.local/lib/python3.6/site-packages/lightgbm/basic.py", line 52, in _safe_call
2021-01-07T15:06:02.5805012Z raise LightGBMError(_LIB.LGBM_GetLastError().decode('utf-8'))
2021-01-07T15:06:02.5811139Z lightgbm.basic.LightGBMError: [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T15:06:02.5811782Z
2021-01-07T15:06:05.3524322Z ##[error]Process completed with exit code 1.
So I will really appreciate your suggestions for
to put more instrumentation around ALL of the CUDA-related memory allocation commands in our code, and they all exist in a single C file
Speaking about to how re-compile and reinstall LightGBM, it is quite simple.
Commands to compile the dynamic library: https://github.com/microsoft/LightGBM/blob/5eee55cc3e5a531e24530cbcd4f027a4b44ebcdd/.github/workflows/cuda.yml#L76-L80 Command to install python package with just compiled library: https://github.com/microsoft/LightGBM/blob/5eee55cc3e5a531e24530cbcd4f027a4b44ebcdd/.github/workflows/cuda.yml#L81
Here is the full script that is used to install and setup Docker, clone repository, install CMake, Python and so on: https://github.com/microsoft/LightGBM/blob/test_cuda/.github/workflows/cuda.yml
(2) just FYI, when I type "python --version" it reports: "Python 3.6.9 :: Anaconda, Inc." Don't know if this matters...
Thanks! I setup the same Python version (3.6
) to mimic your environment.
Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in src/treelearner/cuda_tree_learner.cpp...
Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:
(base) [root@58814263a195 treelearner]# sum cuda_tree_learner.cpp
36657 40
(base) [root@58814263a195 treelearner]#
Give me like 10 minutes, and I'll do a quick suggestion for some debug around that line 414 in
Thank you very much!
Just for "synchronization" here's the sum check on my cuda_tree_learner.cpp, before I add debug to it:
Have you applied two those fixes?
However, I went ahead and fixed the error which didn't allow to compile the library.
- fcdeb10
- 5eee55c
This may or may not end up being a "fix" if it helps, but it's useful information to have, and it's an easy change.
Please replace line 414 of src/treelearner/cuda_tree_learner with a different line, as follows:
Current line:
CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice, stream_[device_id]));
Suggested new line:
CUDASUCCESS_OR_FATAL(cudaMemcpy(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice));
Have you applied two those fixes?
Here what I'm getting after a patch:
Check sum of cuda_tree_learner.cpp
15848 40
Sorry, I don't know how to "apply" a fix.
Oh, never mind. I see now. Give me a couple minutes.
Suggested new line:
CUDASUCCESS_OR_FATAL(cudaMemcpy(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cudaMemcpyHostToDevice));
Done! Building...
Sorry, I don't know how to "apply" a fix.
Simply change lines #537-538
in file src/treelearner/cuda_tree_learner.cpp
from
Tree* CUDATreeLearner::Train(const score_t* gradients, const score_t *hessians) {
Tree *ret = SerialTreeLearner::Train(gradients, hessians);
to
Tree* CUDATreeLearner::Train(const score_t* gradients, const score_t *hessians, bool is_first_tree) {
Tree *ret = SerialTreeLearner::Train(gradients, hessians, is_first_tree);
and line #48
in file src/treelearner/cuda_tree_learner.h
from
Tree* Train(const score_t* gradients, const score_t *hessians);
to
Tree* Train(const score_t* gradients, const score_t *hessians, bool is_first_tree);
Done! Building...
Built! Unfortunately, no changes...
I'm afraid that I can't get more debug info due to the following our issue: #3641.
And, on my end, I can't get this code to build. So frustrating...
so, when you say "unfortunately, no changes" you mean the error reported is exactly the same, even with the change I proposed? That would be good news, because it means the error is actually in THAT CALL, and not in some previous call in the same "async thread"...
still claiming the problem is in line 414, right?
And, on my end, I can't get this code to build. So frustrating...
"this code" = code with these fixes https://github.com/microsoft/LightGBM/issues/3450#issuecomment-756209798?
Maybe you don't have all source files? Could you please try to re-clone the repo and only after that apply a fix?
git clone --recursive https://github.com/microsoft/LightGBM.git
cd LightGBM
git checkout 5d79ff20d1b7ae226531e2445b17d747b253a637
<apply fixes to src/treelearner/cuda_tree_learner.h and src/treelearner/cuda_tree_learner.cpp>
so, when you say "unfortunately, no changes" you mean the error reported is exactly the same, even with the change I proposed?
Yes, absolutely right.
still claiming the problem is in line 414, right?
I guess so. At least the error comes from line 414...
so when I try to build, it's trying to get files from the "external_libs" directory, but in my clone, that directory just contains two empty sub-directories... any idea whether I'm missing some piece of the build that populates those directories? It looks like there's a "setup.py" file that mentions this directory, but I don't know who is supposed to execute that setup command...
We are investigating, but I figured it wouldn't hurt to ask you if you just know the answer off the top of your head...
that directory just contains two empty sub-directories...
Please make sure you don't forget --recursive
flag during cloning the repo.
git clone --recursive https://github.com/microsoft/LightGBM.git
I've tried and can confirm that we can reproduce the error with simple command-line program. I simplified reproducible example so that it no longer requires Python installation. I believe it will help to sync environments.
Fortunately, the error is still the same. But we do not need a proxy of Python layer anymore. Now we run simple regression example from the repository directly via CLI version of LightGBM. Previously we run it via our Python-package.
Please take a look at greatly simplified script (no Python, no any env. variables) we run inside a Docker to reproduce the error: https://github.com/microsoft/LightGBM/blob/bcc3f291c8470bd680aa0c332cfaa3b1a0d01bdd/.github/workflows/cuda.yml#L43-L62
This script
nvidia-smi
cpu
to cuda
in source config file (we will see later [Warning] CUDA currently requires double precision calculations.
warning that proves successful change)cuda_tree_learner.cpp
lightgbm
executable programmAnd here are more verbose logs from the run after applying your proposed change in 414 line of src/treelearner/cuda_tree_learner.cpp
file https://github.com/microsoft/LightGBM/issues/3450#issuecomment-756205729:
2021-01-07T18:54:57.1318861Z [LightGBM] [Warning] CUDA currently requires double precision calculations.
2021-01-07T18:54:57.1320390Z [LightGBM] [Info] Finished loading parameters
2021-01-07T18:54:57.1320991Z [LightGBM] [Debug] Loading train file...
2021-01-07T18:54:57.1405940Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:57.1597220Z [LightGBM] [Warning] Using sparse features with CUDA is currently not supported.
2021-01-07T18:54:58.2787014Z [LightGBM] [Debug] Loading validation file #1...
2021-01-07T18:54:58.2879002Z [LightGBM] [Info] Loading initial scores...
2021-01-07T18:54:58.2964932Z [LightGBM] [Info] Finished loading data in 1.165807 seconds
2021-01-07T18:54:58.2965532Z [LightGBM] [Info] LightGBM using CUDA trainer with DP float!!
2021-01-07T18:54:58.2971585Z [LightGBM] [Info] Total Bins 6132
2021-01-07T18:54:58.2981032Z [LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
2021-01-07T18:54:58.2981689Z [LightGBM] [Debug] device_bin_size_ = 256
2021-01-07T18:54:58.2982161Z [LightGBM] [Debug] Resized feature masks
2021-01-07T18:54:58.2982684Z [LightGBM] [Debug] Memset pinned_feature_masks_
2021-01-07T18:54:58.2983679Z [LightGBM] [Debug] Allocated device_features_ addr=0x7ff5aaa00000 sz=196000
2021-01-07T18:54:58.2985727Z [LightGBM] [Debug] Memset device_data_indices_
2021-01-07T18:54:58.2991002Z [LightGBM] [Fatal] [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.2995493Z [LightGBM] [Debug] created device_subhistograms_: 0x7ff5ab000000
2021-01-07T18:54:58.3027139Z
2021-01-07T18:54:58.3027684Z [LightGBM] [Debug] Started copying dense features from CPU to GPU
2021-01-07T18:54:58.3028247Z Met Exceptions:
2021-01-07T18:54:58.3028802Z [CUDA] invalid argument /LightGBM/src/treelearner/cuda_tree_learner.cpp 414
2021-01-07T18:54:58.3029237Z
2021-01-07T18:54:58.3030255Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 1
2021-01-07T18:54:58.3031103Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3031917Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3032773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3033581Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3034408Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3035216Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3036038Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3036843Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3037660Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3038459Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3039263Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3040077Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3041108Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3041993Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3042794Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3043607Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3044405Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3045225Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3046029Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3046847Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3047646Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3048447Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3049264Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3050082Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3050902Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3051702Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3052521Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3053318Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3054138Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3054939Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3055754Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3056550Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3057347Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3058351Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3059161Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3059976Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3060773Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:58.3061589Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 3
2021-01-07T18:54:58.3062382Z [LightGBM] [Debug] Started copying dense features from CPU to GPU - 2
2021-01-07T18:54:59.6885338Z ##[error]Process completed with exit code 255.
Hope they will help somehow. Please let me know how can I modify the source code of CUDA treelearner more to get useful info that will help to narrow the problem.
So, sorry for the delay in response. My colleague seems to be close to figuring out how we can reproduce this problem on Power systems. You can rest easy for now, because if he is successful, we can handle it from here on out...
Oh, great news! Thank you very much!
And, it is confirmed. On my power system, I get this now:
(base) [root@58814263a195 python-guide]# python simple_example.py Loading data... Starting training... [LightGBM] [Warning] CUDA currently requires double precision calculations. [LightGBM] [Warning] Using sparse features with CUDA is currently not supported. [LightGBM] [Warning] CUDA currently requires double precision calculations. [LightGBM] [Fatal] [CUDA] invalid argument /home/builder/fossum/LightGBM/src/treelearner/cuda_tree_learner.cpp 414
Traceback (most recent call last):
File "simple_example.py", line 39, in
(base) [root@58814263a195 python-guide]#
So, again, I can pursue this now, without pestering you. Wish me luck!
Hah, in any other situation people shouldn't be happy when someone another gets errors from software, but right now I'm happy! 😄 Hope it won't be hard to find a root cause for you.
Again, if you are not comfortable using Python, please check this my message where I show how to reproduce the same error with LightGBM's executable binary from command line interface. Feel free to ask for any details if something is not clear.
@StrikerRUS The problem is that the non-CUDA vector allocators were changed to use kAlignedSize with VirtualFileWriter::AlignedSize between 3.0 and 3.1. Therefore the CUDA vector allocator wasn't allocating enough space in some instances. Here is a purpose change to fix the CUDA vector allocator. simple_example.py and advanced_example.py work with this change.
diff --git a/include/LightGBM/cuda/vector_cudahost.h b/include/LightGBM/cuda/vector_cudahost.h
index 03db338..46698d0 100644
--- a/include/LightGBM/cuda/vector_cudahost.h
+++ b/include/LightGBM/cuda/vector_cudahost.h
@@ -42,6 +42,7 @@ struct CHAllocator {
T* allocate(std::size_t n) {
T* ptr;
if (n == 0) return NULL;
+ n = (n + kAlignedSize - 1) & -kAlignedSize;
#ifdef USE_CUDA
if (LGBM_config_::current_device == lgbm_device_cuda) {
cudaError_t ret = cudaHostAlloc(&ptr, n*sizeof(T), cudaHostAllocPortable);
@austinpagan @ChipKerchner Awesome! I can confirm that this fix helps to get rid from errors on X86 machines as well. Many thanks for the research you've done and providing the fix!
Would you like to contribute this fix from your account so that GitHub will associate fixing commit with you? Or maybe it's not very important for you and you prefer to let do this to someone from LightGBM maintainers to save your time?
Fixed via #3748.
@StrikerRUS This should fix the remaining CUDA failures. Let me know if you see any issues.
diff --git a/src/treelearner/cuda_tree_learner.cpp b/src/treelearner/cuda_tree_learner.cpp
index 16569ee..4495578 100644
--- a/src/treelearner/cuda_tree_learner.cpp
+++ b/src/treelearner/cuda_tree_learner.cpp
@@ -408,7 +408,7 @@ void CUDATreeLearner::copyDenseFeature() {
// looking for dword_features_ non-sparse feature-groups
if (!train_data_->IsMultiGroup(i)) {
dense_feature_group_map_.push_back(i);
- auto sizes_in_byte = train_data_->FeatureGroupSizesInByte(i);
+ auto sizes_in_byte = std::min(train_data_->FeatureGroupSizesInByte(i), static_cast<size_t>(num_data_));
void* tmp_data = train_data_->FeatureGroupData(i);
Log::Debug("Started copying dense features from CPU to GPU - 2");
CUDASUCCESS_OR_FATAL(cudaMemcpyAsync(&device_features[copied_feature * num_data_], tmp_data, sizes_in_byte, cu
@@ -534,8 +534,8 @@ void CUDATreeLearner::InitGPU(int num_gpu) {
copyDenseFeature();
}
@ChipKerchner After applying this fix all but two tests are passed! Very nice indeed!
Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same graphviz
environment issue as in https://github.com/microsoft/LightGBM/pull/3672#issuecomment-757642931.
============================= test session starts ==============================
platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
rootdir: /LightGBM
collected 238 items
../tests/c_api_test/test_.py .. [ 0%]
../tests/python_package_test/test_basic.py ............. [ 6%]
../tests/python_package_test/test_consistency.py ...... [ 8%]
../tests/python_package_test/test_dask.py ............................ [ 20%]
../tests/python_package_test/test_dual.py s [ 21%]
../tests/python_package_test/test_engine.py ............................ [ 32%]
....................................... [ 49%]
../tests/python_package_test/test_plotting.py F...F [ 51%]
../tests/python_package_test/test_sklearn.py ........................... [ 62%]
......x.........................................x....................... [ 92%]
................. [100%]
= 2 failed, 233 passed, 1 skipped, 2 xfailed, 74 warnings in 195.32s (0:03:15) =
Failures in two remaining plotting tests are not related to CUDA implementation. I believe it is the same
graphviz
environment issue as in #3672 (comment).============================= test session starts ============================== platform linux -- Python 3.8.2, pytest-6.2.1, py-1.10.0, pluggy-0.13.1 rootdir: /LightGBM collected 238 items ../tests/python_package_test/test_plotting.py F...F [ 51%]
In my branch , test_plotting passes all tests.
python -m unittest tests/python_package_test/test_plotting.py
.../test_plotting.py:156: UserWarning: More than one metric available, picking one to plot.
ax0 = lgb.plot_metric(evals_result0)
..s
----------------------------------------------------------------------
Ran 5 tests in 1.956s
OK (skipped=2)
woo! Thanks @ChipKerchner . Like @StrikerRUS mentioned, I think it's very very unlikely that the two failing plotting tests are related to your changes. I found in https://github.com/microsoft/LightGBM/pull/3672#issuecomment-757642931 that there might be some issues with the conda-forge recipe for graphviz
.
Yeah, thanks for the info about tests @ChipKerchner ! I'm 100% sure that 2 failing plotting tests on our side is related to our environment. And I'll fix this environment issue during working on making CUDA builds run on a regular basis.
@StrikerRUS: Look at you, making all our dreams come true!!! Thank you!
@austinpagan Thanks a lot for all your hard work!
I'm trying to use multi-GPUs to train the model. When I increase the number of data, this issue happens.
Everything goes well if the size of train set is less than 10000.
Operating System: Linux
CPU/GPU model: GPU