Closed szilard closed 3 years ago
Thanks very much for the detailed report and great reproducible example! I'll take a look at this later today.
Thank you so much @jameslamb
Actually, the old way of installing the R package works:
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && sed -i "s/use_gpu <- FALSE/use_gpu <- TRUE/" R-package/src/install.libs.R && Rscript build_r.R
Oh, I was wrong, the old method above compiles indeed, but it actually does not work:
Error in lgb.call(fun_name = "LGBM_BoosterCreate_R", ret = handle, train_set_handle, :
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
@szilard Hey!
Let me put my two cents in it 🙂 . I completely out of ideas how sed
-based installation is working while argument-based one is not. Hope @jameslamb will find the root cause.
But I took a look at your Docker file. Seems that you are inheriting from nvidia/cuda:11.0-devel-ubuntu20.04
image. devel
part in its name makes me sure that that image has already OpenCL libraries installed. So you don't need to install separate packages ocl-icd-opencl-dev opencl-headers
via apt
. You just need to make env variables to point at appropriate paths. Please take a look at this section of our Docker example.
The following action I can see in your Docker already 👍 https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L59-L61
In addition, performing those actions might not be enough. Then you can add similar paths from the example below to CMake commands https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L87 here https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/R-package/src/install.libs.R#L166-L168 and https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L88 here https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/R-package/src/install.libs.R#L135
I suspect that CMake goes mad somehow due to two OpenCL installations. Look at these lines from your log:
-- Looking for CL_VERSION_2_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "2.2")
"found" and "Could NOT find" at the same time.
BTW,
However, just compiling lightgbm (not the R package) seems fine:
You just need to make env variables to point at appropriate paths.
Probably this is the reason... I mean, maybe appropriate env variables are already set in Docker command line, but R doesn't see them.
Well, as I corrected myself later, the sed
version actually does not work properly anymore either (it compiles, but it does not add GPU support actually).
Yeah, my Dockerfile has a history of additions over the years (and hacks like the echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd
thing), I'll see if I can clean it up with your suggestions @StrikerRUS.
However, lightgbm compiles fine outside the R package, so it seems it's only the R package that gets confused about OpenCL.
Trying to understand what's going on, trying to strip down things as much as possible:
If I remove the ocl-icd-opencl-dev opencl-headers clinfo
install from my docker:
FROM nvidia/cuda:11.0-devel-ubuntu20.04
RUN apt-get update && \
DEBIAN_FRONTEND="noninteractive" apt-get install -y software-properties-common apt-transport-https
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
add-apt-repository 'deb [arch=amd64] https://cran.rstudio.com/bin/linux/ubuntu focal-cran40/' && \
apt-get update && \
apt-get install -y r-base
RUN apt-get install -y git wget libcurl4-openssl-dev default-jdk-headless libssl-dev libxml2-dev cmake
ENV MAKE="make -j$(nproc)"
RUN R -e 'install.packages(c("R6","data.table","jsonlite"), repos = "https://cran.rstudio.com/")'
##RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-opencl-dev opencl-headers clinfo
RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev
RUN mkdir -p /etc/OpenCL/vendors && \
echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd ## otherwise lightgm segfaults at runtime (compiles fine without it)
Then the R install errors out (obviously) with:
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && \
Rscript build_r.R --use-gpu
ERROR:
...
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - not found
-- Looking for CL_VERSION_1_1
-- Looking for CL_VERSION_1_1 - not found
-- Looking for CL_VERSION_1_0
-- Looking for CL_VERSION_1_0 - not found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Could NOT find OpenCL (missing: OpenCL_LIBRARY OpenCL_INCLUDE_DIR)
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
CMakeLists.txt:138 (find_package)
...
The non-R install works even if striped down to this (no need for any ENV variables or any other compiler flags mentioned above by @StrikerRUS ):
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
make
WORKS OK:
...
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - found
-- Found OpenCL: /usr/local/cuda/lib64/libOpenCL.so (found version "1.2")
-- OpenCL include directory: /usr/local/cuda/include
...
Removing any of DOpenCL_LIBRARY
or DOpenCL_INCLUDE_DIR
flag breaks:
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so .. && \
make
ERROR:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - not found
-- Looking for CL_VERSION_1_1
-- Looking for CL_VERSION_1_1 - not found
-- Looking for CL_VERSION_1_0
-- Looking for CL_VERSION_1_0 - not found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Could NOT find OpenCL (missing: OpenCL_INCLUDE_DIR)
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
CMakeLists.txt:138 (find_package)
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
make
ERROR:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "1.2")
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
CMakeLists.txt:138 (find_package)
If I add back the ocl-icd-opencl-dev opencl-headers clinfo
install to my docker:
then the R package fails (this was yesterday's result, just including here to see the error message in context):
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && \
Rscript build_r.R --use-gpu
ERROR:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "2.2")
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
CMakeLists.txt:138 (find_package)
but now the non-R install can be stripped down even more of flags and it still works:
git clone --recursive https://github.com/microsoft/LightGBM && \
cd LightGBM && mkdir build && cd build && \
cmake -DUSE_GPU=1 .. && \
make
WORKS:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
and notice it has found now OpenCL 2.2 instead of previously the old 1.2 included in nvidia/cuda:11.0-devel-ubuntu20.04
or the other NVIDIA docker images like the one you guys are using.
Based on this it seems to me something's up with the R package (likely it can't get the OpenCL_LIBRARY
value when it's calling cmake).
(And I think having ocl-icd-opencl-dev opencl-headers clinfo
installed it's preferable because it's a newer OpenCL version and also takes care of the include and lib paths at least in the non-R install).
If I make this change in R-package/src/install.libs.R
:
- cmake_args <- c(cmake_args, "-DUSE_GPU=ON")
+ cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so")
then it finds OpenCL:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
However, then it can't find Boost now:
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
Could NOT find Boost (missing: filesystem system) (found suitable version
"1.71.0", minimum required is "1.56.0")
Call Stack (most recent call first):
/usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
/usr/share/cmake-3.16/Modules/FindBoost.cmake:2179 (find_package_handle_standard_args)
CMakeLists.txt:144 (find_package)
It seems like the R build script cannot find the necessary paths anymore somehow (for the GPU install), not only OpenCL.
It seems like the R build script cannot find the necessary paths anymore somehow (for the GPU install), not only OpenCL.
Could you please try adding Boost paths (-DBOOST_INCLUDEDIR=
) to cmake_args
here as well?
cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so")
On our CI it finds Boost in usr/include
:
-- Found Boost: /usr/include (found suitable version "1.74.0", minimum required is "1.56.0") found components: filesystem system
Actual path for your Docker you can take from a successful installation from command line, I believe. Though, it's quite strange.
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system
And I think having
ocl-icd-opencl-dev opencl-headers clinfo
installed it's preferable because it's a newer OpenCL version and also takes care of the include and lib paths at least in the non-R install
I'm not sure that newer version from Ubuntu ppa is better than preinstalled native version from NVIDIA in case you are really using NVIDIA cards for training.
@jameslamb I believe R-package needs the same additional command line options for GPU-version as our Python-package:
- boost-root
- boost-dir
- boost-include-dir
- boost-librarydir
- opencl-include-dir
- opencl-library
https://github.com/microsoft/LightGBM/tree/master/python-package#build-gpu-version
I can't make it pass Boost by adding -DBOOST_INCLUDEDIR=...
to cmake_args
.
All this strange, because last time I ran the benchmarks (September 2020) it was all working.
Please try -DBOOST_INCLUDEDIR=/usr/include/boost
in cmake_args
. And in case of failure again, add -DBOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu
as well.
Indeed, with this old Lightgbm commit 7e11d4aeabd4a39ff (Aug 30, 2020) and using the old sed
method:
sed -i "s/use_gpu <- FALSE/use_gpu <- TRUE/" R-package/src/install.libs.R && Rscript build_r.R
it works:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/RtmpwQ9CUI/R.INSTALLa3474e269e7/lightgbm/src/build
Building lib_lightgbm
Scanning dependencies of target _lightgbm
[ 15%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/bin.cpp.o
[ 15%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_prediction.cpp.o
[ 18%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/prediction_early_stop.cpp.o
Well, actually I don't know, it compiles, though now I'm on an instance without GPU, so I'm not sure if it adds GPU support (yesterday it seemed that the old sed
thing compiles with the latest commit, but it does not add GPU support)
Sorry for possible confusion, maybe I did not explain it the best way, what I mean is that
sed -i "s/use_gpu <- FALSE/use_gpu <- TRUE/" R-package/src/install.libs.R && Rscript build_r.R
seems to always compile OK, but in the latest lightgbm version it actually does not add GPU support.
In fact I think I'm able to see if GPU support was added even on this non-GPU instance, because with old commit 7e11d4a (Aug 30, 2020) after compiling with the sed
thing, I get:
Error in lgb.last_error() : api error: No OpenCL device found
Error in initialize(...) : lgb.Booster: cannot create Booster handle
while with the latest Lightgbm commit I get (compiling with the sed
thing):
Error in lgb.call(fun_name = "LGBM_BoosterCreate_R", ret = handle, train_set_handle, :
[LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
So it seems old sed
method of installing R with GPU support compiles, but at some point since September 2020 it does not actually add GPU support.
I'm not sure if it would help us fix the main issue if we find out which commit broke this (it's weird that the old sed
thing still compiles even now).
With this:
- cmake_args <- c(cmake_args, "-DUSE_GPU=ON")
+ cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so -DBOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu")
(added boost libdir but not the include)
it compiles:
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
-- Found Boost: /usr/include (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system
and it will probably run.
I get
Error in lgb.last_error() : api error: No OpenCL device found
Error in initialize(...) : lgb.Booster: cannot create Booster handle
but on a box without GPU (good sign), I'll have to try it out on an instance with GPU.
The code I'm running btw:
suppressMessages({
library(data.table)
library(ROCR)
library(lightgbm)
library(Matrix)
})
set.seed(123)
d_train <- fread("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv", showProgress=FALSE)
d_test <- fread("https://s3.amazonaws.com/benchm-ml--main/test.csv", showProgress=FALSE)
d_all <- rbind(d_train, d_test)
d_all$dep_delayed_15min <- ifelse(d_all$dep_delayed_15min=="Y",1,0)
d_all_wrules <- lgb.convert_with_rules(d_all)
d_all <- d_all_wrules$data
cols_cats <- names(d_all_wrules$rules)
d_train <- d_all[1:nrow(d_train)]
d_test <- d_all[(nrow(d_train)+1):(nrow(d_train)+nrow(d_test))]
p <- ncol(d_all)-1
dlgb_train <- lgb.Dataset(data = as.matrix(d_train[,1:p]), label = d_train$dep_delayed_15min, free_raw_data = FALSE)
cat(system.time({
md <- lgb.train(data = dlgb_train,
objective = "binary",
nrounds = 100, num_leaves = 512, learning_rate = 0.1,
categorical_feature = cols_cats,
device = "gpu",
verbose = 0)
})[[3]]," ",sep="")
phat <- predict(md, data = as.matrix(d_test[,1:p]))
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
Error in lgb.last_error() : api error: No OpenCL device found
Nice, given that the error happens on non-GPU machine! Indeed good sign!
But please note that successfully compiled GPU version and using device_type='gpu'
in params may still result in training on CPU. This can occur with CPU that have onboard graphics and some combination of system-wide default platform and device (refer to gpu_platform_id
and gpu_device_id
). So to be 100% sure LightGBM uses real GPU please take a look at training log and find this line
[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation
or use nvidia-smi
;)
I ran it on an instance with GPU (p3 with V100):
With this patch:
- cmake_args <- c(cmake_args, "-DUSE_GPU=ON")
+ cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so -DBOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu")
that is by using this hack in my Dockerfile (and with ocl-icd-opencl-dev opencl-headers clinfo
added back):
RUN git clone --recursive https://github.com/microsoft/LightGBM && cd LightGBM && \
sed -i 's/cmake_args <- c(cmake_args, "-DUSE_GPU=ON")/cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=\/usr\/lib\/x86_64-linux-gnu\/libOpenCL.so -DBOOST_LIBRARYDIR=\/usr\/lib\/x86_64-linux-gnu")/' R-package/src/install.libs.R && \
Rscript build_r.R --use-gpu
it is compiling and running OK.
Full Dockerfile: https://github.com/szilard/GBM-perf/blob/f34c37357e82f7dd3d8f30e5625a7f268a3b98a5/gpu/Dockerfile
Full R code running: https://github.com/szilard/GBM-perf/blob/f34c37357e82f7dd3d8f30e5625a7f268a3b98a5/gpu/run/3-lightgbm.R
I wonder if on other systems it works out of the box or not (without adding the paths with the patch) as it used to run for me as well.
I wonder if on other systems it works out of the box or not (without adding the paths with the patch) as it used to run for me as well.
Those paths are default ones. Very strange that they are not propagated into R...
Thanks for such nice reproducible examples @szilard ! I can look into this this weekend, and probably expose more options via the build_r.R
command-line args, so you don't have to use sed.
Sounds great @jameslamb, thank you.
Thanks to both of you for all the great information, and a nice reproducible example!
I've proposed what I think could be a fix, in https://github.com/microsoft/LightGBM/pull/3779. It wouldn't "just work", but would at least allow you to pass in these paths as command-line args like you can in the Python package, so no one would need to use sed
to re-write install.libs.R
.
Thanks @jameslamb for fix and merging into LightGBM master. I changed the Dockerfile in my repo GBM-perf to take advantage of this fix (replaced the sed
hack with flags to the build script): https://github.com/szilard/GBM-perf/commit/3b56bf0b474edd5dcf8039c9ddd86cddb9c1d845 Thanks.
@szilard I'm afraid you have a typo (duplicated =
sign) in the commit you've linked:
--boost-librarydir==/usr/lib/x86_64-linux-gnu
------------------^--------------
Quite strange that even with typo compilation succeed.
Thanks @StrikerRUS , I fixed it now. Yeah, strange indeed it was compiling with the ==
as well.
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.
This used to work:
Now, I get this error:
If I build the docker image with the last
RUN
entry commented out:with
and then run it:
then I can run things manually:
gives the same error.
However, just compiling lightgbm (not the R package) seems fine:
as here:
though I also see
but it compiles anyway:
So there must be something in the R package(?) cc @jameslamb