microsoft / LightGBM

A fast, distributed, high performance gradient boosting (GBT, GBDT, GBRT, GBM or MART) framework based on decision tree algorithms, used for ranking, classification and many other machine learning tasks.
https://lightgbm.readthedocs.io/en/latest/
MIT License
16.56k stars 3.82k forks source link

R package install with GPU support fails #3765

Closed szilard closed 3 years ago

szilard commented 3 years ago

This used to work:

FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install -y software-properties-common apt-transport-https

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    add-apt-repository 'deb [arch=amd64] https://cran.rstudio.com/bin/linux/ubuntu focal-cran40/' && \
    apt-get update && \
    apt-get install -y r-base

RUN apt-get install -y git wget libcurl4-openssl-dev default-jdk-headless libssl-dev libxml2-dev cmake

ENV MAKE="make -j$(nproc)"

RUN R -e 'install.packages(c("R6","data.table","jsonlite"), repos = "https://cran.rstudio.com/")'

RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-opencl-dev opencl-headers clinfo

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd   ## otherwise lightgm segfaults at runtime (compiles fine without it)

RUN git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    Rscript build_r.R --use-gpu

Now, I get this error:

Cloning into 'LightGBM'...
Submodule 'include/boost/compute' (https://github.com/boostorg/compute) registered for path 'compute'
Submodule 'eigen' (https://gitlab.com/libeigen/eigen.git) registered for path 'eigen'
Submodule 'external_libs/fast_double_parser' (https://github.com/lemire/fast_double_parser.git) registered for path 'external_libs/fast_double_parser'
Submodule 'external_libs/fmt' (https://github.com/fmtlib/fmt.git) registered for path 'external_libs/fmt'
Cloning into '/LightGBM/compute'...
Cloning into '/LightGBM/eigen'...
Cloning into '/LightGBM/external_libs/fast_double_parser'...
Cloning into '/LightGBM/external_libs/fmt'...
Submodule path 'compute': checked out '36c89134d4013b2e5e45bc55656a18bd6141995a'
Submodule path 'eigen': checked out '8ba1b0f41a7950dc3e1d4ed75859e36c73311235'
Submodule path 'external_libs/fast_double_parser': checked out 'ace60646c02dc54c57f19d644e49a61e7e7758ec'
Submodule 'benchmark/dependencies/abseil-cpp' (https://github.com/abseil/abseil-cpp.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'
Submodule 'benchmark/dependencies/double-conversion' (https://github.com/google/double-conversion.git) registered for path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'
Cloning into '/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp'...
Cloning into '/LightGBM/external_libs/fast_double_parser/benchmarks/dependencies/double-conversion'...
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp': checked out 'd936052d32a5b7ca08b0199a6724724aea432309'
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion': checked out 'f4cb2384efa55dee0e6652f8674b05763441ab09'
Submodule path 'external_libs/fmt': checked out 'cc09f1a6798c085c325569ef466bcdcffdc266d4'
* checking for file '/LightGBM/lightgbm_r/DESCRIPTION' ... OK
* preparing 'lightgbm':
* checking DESCRIPTION meta-information ... OK
* cleaning src
Warning in system2(command, args, stdout = NULL, stderr = NULL, ...) :
  error in running command
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
WARNING: directory 'lightgbm/src/compute/test' is empty
* looking to see if a 'data/datalist' file should be added
* building 'lightgbm_3.1.1.99.tar.gz'

* installing to library '/usr/local/lib/R/site-library'
* installing *source* package 'lightgbm' ...
** using staged installation
** libs
installing via 'install.libs.R' to /usr/local/lib/R/site-library/00LOCK-lightgbm/00new/lightgbm
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- R version passed into FindLibR.cmake: 4.0.3
-- Found LibR: /usr/lib/R
-- LIBR_EXECUTABLE: /usr/bin/R
-- LIBR_INCLUDE_DIRS: /usr/share/R/include
-- LIBR_CORE_LIBRARY: /usr/lib/R/lib/libR.so
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "2.2")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
  CMakeLists.txt:138 (find_package)

-- Configuring incomplete, errors occurred!
See also "/tmp/RtmpvcXiAX/R.INSTALL14755eba078/lightgbm/src/build/CMakeFiles/CMakeOutput.log".
Error in .run_shell_command("cmake", c(cmake_args, "..")) :
  Command failed with exit code: 1
* removing '/usr/local/lib/R/site-library/lightgbm'
Error in .run_shell_command(install_cmd, install_args) :
  Command failed with exit code: 1
Execution halted
The command '/bin/sh -c git clone --recursive https://github.com/microsoft/LightGBM &&     cd LightGBM &&     Rscript build_r.R --use-gpu' returned a non-zero code: 1

If I build the docker image with the last RUN entry commented out:

FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install -y software-properties-common apt-transport-https

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    add-apt-repository 'deb [arch=amd64] https://cran.rstudio.com/bin/linux/ubuntu focal-cran40/' && \
    apt-get update && \
    apt-get install -y r-base

RUN apt-get install -y git wget libcurl4-openssl-dev default-jdk-headless libssl-dev libxml2-dev cmake

ENV MAKE="make -j$(nproc)"

RUN R -e 'install.packages(c("R6","data.table","jsonlite"), repos = "https://cran.rstudio.com/")'

RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-opencl-dev opencl-headers clinfo

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd   ## otherwise lightgm segfaults at runtime (compiles fine without it)

#RUN git clone --recursive https://github.com/microsoft/LightGBM && \
#    cd LightGBM && \
#    Rscript build_r.R --use-gpu

with

sudo docker build -t gbmperf_gpu .

and then run it:

sudo nvidia-docker run --rm -ti gbmperf_gpu /bin/bash

then I can run things manually:

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    Rscript build_r.R --use-gpu

gives the same error.

However, just compiling lightgbm (not the R package) seems fine:

git clone --recursive https://github.com/microsoft/LightGBM  && \
cd LightGBM  &&  mkdir build  &&  cd build  &&  cmake -DUSE_GPU=1 ..  &&  make -j4

as here:

...
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/abseil-cpp': checked out 'd936052d32a5b7ca08b0199a6724724aea432309'
Submodule path 'external_libs/fast_double_parser/benchmarks/dependencies/double-conversion': checked out 'f4cb2384efa55dee0e6652f8674b05763441ab09'
Submodule path 'external_libs/fmt': checked out 'cc09f1a6798c085c325569ef466bcdcffdc266d4'
-- The C compiler identification is GNU 9.3.0
-- The CXX compiler identification is GNU 9.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /LightGBM/LightGBM/build
make[1]: warning: -j0 forced in submake: resetting jobserver mode.
Scanning dependencies of target lightgbm
Scanning dependencies of target _lightgbm
[  1%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/boosting.cpp.o
[  2%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt.cpp.o
[  4%] Building CXX object CMakeFiles/lightgbm.dir/src/boosting/gbdt.cpp.o
[  7%] Building CXX object CMakeFiles/lightgbm.dir/src/boosting/boosting.cpp.o
[  7%] Building CXX object CMakeFiles/lightgbm.dir/src/main.cpp.o
[  8%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/prediction_early_stop.cpp.o
[ 10%] Building CXX object CMakeFiles/lightgbm.dir/src/application/application.cpp.o
...

though I also see

/usr/include/CL/cl_version.h:34:104: note: #pragma message: cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)
   34 | #pragma message("cl_version.h: CL_TARGET_OPENCL_VERSION is not defined. Defaulting to 220 (OpenCL 2.2)")

but it compiles anyway:

[ 98%] Linking CXX shared library ../lib_lightgbm.so
[100%] Linking CXX executable ../lightgbm
[100%] Built target _lightgbm
[100%] Built target lightgbm

So there must be something in the R package(?) cc @jameslamb

jameslamb commented 3 years ago

Thanks very much for the detailed report and great reproducible example! I'll take a look at this later today.

szilard commented 3 years ago

Thank you so much @jameslamb

szilard commented 3 years ago

Actually, the old way of installing the R package works:

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && sed -i "s/use_gpu <- FALSE/use_gpu <- TRUE/"  R-package/src/install.libs.R && Rscript build_r.R
szilard commented 3 years ago

Oh, I was wrong, the old method above compiles indeed, but it actually does not work:

Error in lgb.call(fun_name = "LGBM_BoosterCreate_R", ret = handle, train_set_handle,  :
  [LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1
StrikerRUS commented 3 years ago

@szilard Hey!

Let me put my two cents in it 🙂 . I completely out of ideas how sed-based installation is working while argument-based one is not. Hope @jameslamb will find the root cause.

But I took a look at your Docker file. Seems that you are inheriting from nvidia/cuda:11.0-devel-ubuntu20.04 image. devel part in its name makes me sure that that image has already OpenCL libraries installed. So you don't need to install separate packages ocl-icd-opencl-dev opencl-headers via apt. You just need to make env variables to point at appropriate paths. Please take a look at this section of our Docker example.

https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L15-L20

The following action I can see in your Docker already 👍 https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L59-L61

In addition, performing those actions might not be enough. Then you can add similar paths from the example below to CMake commands https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L87 here https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/R-package/src/install.libs.R#L166-L168 and https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/docker/gpu/dockerfile.gpu#L88 here https://github.com/microsoft/LightGBM/blob/f997a0692ca0f26740d2bdef2695c3e881d4e918/R-package/src/install.libs.R#L135

I suspect that CMake goes mad somehow due to two OpenCL installations. Look at these lines from your log:

-- Looking for CL_VERSION_2_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "2.2")

"found" and "Could NOT find" at the same time.

StrikerRUS commented 3 years ago

BTW,

However, just compiling lightgbm (not the R package) seems fine:

You just need to make env variables to point at appropriate paths.

Probably this is the reason... I mean, maybe appropriate env variables are already set in Docker command line, but R doesn't see them.

szilard commented 3 years ago

Well, as I corrected myself later, the sed version actually does not work properly anymore either (it compiles, but it does not add GPU support actually).

Yeah, my Dockerfile has a history of additions over the years (and hacks like the echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd thing), I'll see if I can clean it up with your suggestions @StrikerRUS.

However, lightgbm compiles fine outside the R package, so it seems it's only the R package that gets confused about OpenCL.

szilard commented 3 years ago

Trying to understand what's going on, trying to strip down things as much as possible:

If I remove the ocl-icd-opencl-dev opencl-headers clinfo install from my docker:

FROM nvidia/cuda:11.0-devel-ubuntu20.04

RUN apt-get update && \
    DEBIAN_FRONTEND="noninteractive" apt-get install -y software-properties-common apt-transport-https

RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    add-apt-repository 'deb [arch=amd64] https://cran.rstudio.com/bin/linux/ubuntu focal-cran40/' && \
    apt-get update && \
    apt-get install -y r-base

RUN apt-get install -y git wget libcurl4-openssl-dev default-jdk-headless libssl-dev libxml2-dev cmake

ENV MAKE="make -j$(nproc)"

RUN R -e 'install.packages(c("R6","data.table","jsonlite"), repos = "https://cran.rstudio.com/")'

##RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev ocl-icd-opencl-dev opencl-headers clinfo
RUN apt-get install -y libboost-dev libboost-system-dev libboost-filesystem-dev

RUN mkdir -p /etc/OpenCL/vendors && \
    echo "libnvidia-opencl.so.1" > /etc/OpenCL/vendors/nvidia.icd   ## otherwise lightgm segfaults at runtime (compiles fine without it)

Then the R install errors out (obviously) with:

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    Rscript build_r.R --use-gpu

ERROR:
...
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - not found
-- Looking for CL_VERSION_1_1
-- Looking for CL_VERSION_1_1 - not found
-- Looking for CL_VERSION_1_0
-- Looking for CL_VERSION_1_0 - not found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_LIBRARY OpenCL_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
  CMakeLists.txt:138 (find_package)
...

The non-R install works even if striped down to this (no need for any ENV variables or any other compiler flags mentioned above by @StrikerRUS ):

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && mkdir build && cd build && \
    cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
    make 

WORKS OK:

...
-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - found
-- Found OpenCL: /usr/local/cuda/lib64/libOpenCL.so (found version "1.2")
-- OpenCL include directory: /usr/local/cuda/include
...

Removing any of DOpenCL_LIBRARY or DOpenCL_INCLUDE_DIR flag breaks:

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && mkdir build && cd build && \
    cmake -DUSE_GPU=1 -DOpenCL_LIBRARY=/usr/local/cuda/lib64/libOpenCL.so .. && \
    make     

ERROR:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - not found
-- Looking for CL_VERSION_1_1
-- Looking for CL_VERSION_1_1 - not found
-- Looking for CL_VERSION_1_0
-- Looking for CL_VERSION_1_0 - not found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_INCLUDE_DIR)
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
  CMakeLists.txt:138 (find_package)
git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && mkdir build && cd build && \
    cmake -DUSE_GPU=1 -DOpenCL_INCLUDE_DIR=/usr/local/cuda/include/ .. && \
    make 

ERROR:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - not found
-- Looking for CL_VERSION_2_1
-- Looking for CL_VERSION_2_1 - not found
-- Looking for CL_VERSION_2_0
-- Looking for CL_VERSION_2_0 - not found
-- Looking for CL_VERSION_1_2
-- Looking for CL_VERSION_1_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "1.2")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
  CMakeLists.txt:138 (find_package)

If I add back the ocl-icd-opencl-dev opencl-headers clinfo install to my docker:

then the R package fails (this was yesterday's result, just including here to see the error message in context):

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && \
    Rscript build_r.R --use-gpu

ERROR:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find OpenCL (missing: OpenCL_LIBRARY) (found version "2.2")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindOpenCL.cmake:150 (find_package_handle_standard_args)
  CMakeLists.txt:138 (find_package)

but now the non-R install can be stripped down even more of flags and it still works:

git clone --recursive https://github.com/microsoft/LightGBM && \
    cd LightGBM && mkdir build && cd build && \
    cmake -DUSE_GPU=1 .. && \
    make 

WORKS:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include

and notice it has found now OpenCL 2.2 instead of previously the old 1.2 included in nvidia/cuda:11.0-devel-ubuntu20.04 or the other NVIDIA docker images like the one you guys are using.

Based on this it seems to me something's up with the R package (likely it can't get the OpenCL_LIBRARY value when it's calling cmake).

(And I think having ocl-icd-opencl-dev opencl-headers clinfo installed it's preferable because it's a newer OpenCL version and also takes care of the include and lib paths at least in the non-R install).

szilard commented 3 years ago

If I make this change in R-package/src/install.libs.R:

-  cmake_args <- c(cmake_args, "-DUSE_GPU=ON")
+  cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so")

then it finds OpenCL:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include

However, then it can't find Boost now:

CMake Error at /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:146 (message):
  Could NOT find Boost (missing: filesystem system) (found suitable version
  "1.71.0", minimum required is "1.56.0")
Call Stack (most recent call first):
  /usr/share/cmake-3.16/Modules/FindPackageHandleStandardArgs.cmake:393 (_FPHSA_FAILURE_MESSAGE)
  /usr/share/cmake-3.16/Modules/FindBoost.cmake:2179 (find_package_handle_standard_args)
  CMakeLists.txt:144 (find_package)

It seems like the R build script cannot find the necessary paths anymore somehow (for the GPU install), not only OpenCL.

StrikerRUS commented 3 years ago

It seems like the R build script cannot find the necessary paths anymore somehow (for the GPU install), not only OpenCL.

Could you please try adding Boost paths (-DBOOST_INCLUDEDIR=) to cmake_args here as well?

cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so")

On our CI it finds Boost in usr/include:

-- Found Boost: /usr/include (found suitable version "1.74.0", minimum required is "1.56.0") found components: filesystem system 

Actual path for your Docker you can take from a successful installation from command line, I believe. Though, it's quite strange.

-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system

And I think having ocl-icd-opencl-dev opencl-headers clinfo installed it's preferable because it's a newer OpenCL version and also takes care of the include and lib paths at least in the non-R install

I'm not sure that newer version from Ubuntu ppa is better than preinstalled native version from NVIDIA in case you are really using NVIDIA cards for training.

StrikerRUS commented 3 years ago

@jameslamb I believe R-package needs the same additional command line options for GPU-version as our Python-package:

- boost-root
- boost-dir
- boost-include-dir
- boost-librarydir
- opencl-include-dir
- opencl-library

https://github.com/microsoft/LightGBM/tree/master/python-package#build-gpu-version

szilard commented 3 years ago

I can't make it pass Boost by adding -DBOOST_INCLUDEDIR=... to cmake_args.

szilard commented 3 years ago

All this strange, because last time I ran the benchmarks (September 2020) it was all working.

StrikerRUS commented 3 years ago

Please try -DBOOST_INCLUDEDIR=/usr/include/boost in cmake_args. And in case of failure again, add -DBOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu as well.

szilard commented 3 years ago

Indeed, with this old Lightgbm commit 7e11d4aeabd4a39ff (Aug 30, 2020) and using the old sed method:

sed -i "s/use_gpu <- FALSE/use_gpu <- TRUE/"  R-package/src/install.libs.R && Rscript build_r.R

it works:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
-- Found Boost: /usr/lib/x86_64-linux-gnu/cmake/Boost-1.71.0/BoostConfig.cmake (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system
-- Performing Test MM_PREFETCH
-- Performing Test MM_PREFETCH - Success
-- Using _mm_prefetch
-- Performing Test MM_MALLOC
-- Performing Test MM_MALLOC - Success
-- Using _mm_malloc
-- Configuring done
-- Generating done
-- Build files have been written to: /tmp/RtmpwQ9CUI/R.INSTALLa3474e269e7/lightgbm/src/build
Building lib_lightgbm
Scanning dependencies of target _lightgbm
[ 15%] Building CXX object CMakeFiles/_lightgbm.dir/src/io/bin.cpp.o
[ 15%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/gbdt_prediction.cpp.o
[ 18%] Building CXX object CMakeFiles/_lightgbm.dir/src/boosting/prediction_early_stop.cpp.o
szilard commented 3 years ago

Well, actually I don't know, it compiles, though now I'm on an instance without GPU, so I'm not sure if it adds GPU support (yesterday it seemed that the old sed thing compiles with the latest commit, but it does not add GPU support)

szilard commented 3 years ago

Sorry for possible confusion, maybe I did not explain it the best way, what I mean is that

sed -i "s/use_gpu <- FALSE/use_gpu <- TRUE/"  R-package/src/install.libs.R && Rscript build_r.R

seems to always compile OK, but in the latest lightgbm version it actually does not add GPU support.

In fact I think I'm able to see if GPU support was added even on this non-GPU instance, because with old commit 7e11d4a (Aug 30, 2020) after compiling with the sed thing, I get:

Error in lgb.last_error() : api error: No OpenCL device found
Error in initialize(...) : lgb.Booster: cannot create Booster handle

while with the latest Lightgbm commit I get (compiling with the sed thing):

Error in lgb.call(fun_name = "LGBM_BoosterCreate_R", ret = handle, train_set_handle,  :
  [LightGBM] [Fatal] GPU Tree Learner was not enabled in this build.
Please recompile with CMake option -DUSE_GPU=1

So it seems old sed method of installing R with GPU support compiles, but at some point since September 2020 it does not actually add GPU support.

I'm not sure if it would help us fix the main issue if we find out which commit broke this (it's weird that the old sed thing still compiles even now).

szilard commented 3 years ago

With this:

-  cmake_args <- c(cmake_args, "-DUSE_GPU=ON")
+  cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so -DBOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu")

(added boost libdir but not the include)

it compiles:

-- Looking for CL_VERSION_2_2
-- Looking for CL_VERSION_2_2 - found
-- Found OpenCL: /usr/lib/x86_64-linux-gnu/libOpenCL.so (found version "2.2")
-- OpenCL include directory: /usr/include
-- Found Boost: /usr/include (found suitable version "1.71.0", minimum required is "1.56.0") found components: filesystem system

and it will probably run.

I get

Error in lgb.last_error() : api error: No OpenCL device found
Error in initialize(...) : lgb.Booster: cannot create Booster handle

but on a box without GPU (good sign), I'll have to try it out on an instance with GPU.

The code I'm running btw:

suppressMessages({
library(data.table)
library(ROCR)
library(lightgbm)
library(Matrix)
})

set.seed(123)

d_train <- fread("https://s3.amazonaws.com/benchm-ml--main/train-1m.csv", showProgress=FALSE)
d_test <- fread("https://s3.amazonaws.com/benchm-ml--main/test.csv", showProgress=FALSE)

d_all <- rbind(d_train, d_test)
d_all$dep_delayed_15min <- ifelse(d_all$dep_delayed_15min=="Y",1,0)

d_all_wrules <- lgb.convert_with_rules(d_all)       
d_all <- d_all_wrules$data
cols_cats <- names(d_all_wrules$rules) 

d_train <- d_all[1:nrow(d_train)]
d_test <- d_all[(nrow(d_train)+1):(nrow(d_train)+nrow(d_test))]

p <- ncol(d_all)-1
dlgb_train <- lgb.Dataset(data = as.matrix(d_train[,1:p]), label = d_train$dep_delayed_15min, free_raw_data = FALSE)

cat(system.time({
  md <- lgb.train(data = dlgb_train, 
            objective = "binary", 
            nrounds = 100, num_leaves = 512, learning_rate = 0.1, 
            categorical_feature = cols_cats,
            device = "gpu",
            verbose = 0)
})[[3]]," ",sep="")

phat <- predict(md, data = as.matrix(d_test[,1:p]))
rocr_pred <- prediction(phat, d_test$dep_delayed_15min)
cat(performance(rocr_pred, "auc")@y.values[[1]],"\n")
StrikerRUS commented 3 years ago

Error in lgb.last_error() : api error: No OpenCL device found

Nice, given that the error happens on non-GPU machine! Indeed good sign!

But please note that successfully compiled GPU version and using device_type='gpu' in params may still result in training on CPU. This can occur with CPU that have onboard graphics and some combination of system-wide default platform and device (refer to gpu_platform_id and gpu_device_id). So to be 100% sure LightGBM uses real GPU please take a look at training log and find this line

[LightGBM] [Info] Using GPU Device: GeForce MX150, Vendor: NVIDIA Corporation

szilard commented 3 years ago

or use nvidia-smi ;)

szilard commented 3 years ago

I ran it on an instance with GPU (p3 with V100):

With this patch:

-  cmake_args <- c(cmake_args, "-DUSE_GPU=ON")
+  cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=/usr/lib/x86_64-linux-gnu/libOpenCL.so -DBOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu")

that is by using this hack in my Dockerfile (and with ocl-icd-opencl-dev opencl-headers clinfo added back):

RUN git clone --recursive https://github.com/microsoft/LightGBM && cd LightGBM && \
    sed -i 's/cmake_args <- c(cmake_args, "-DUSE_GPU=ON")/cmake_args <- c(cmake_args, "-DUSE_GPU=ON -DOpenCL_LIBRARY=\/usr\/lib\/x86_64-linux-gnu\/libOpenCL.so -DBOOST_LIBRARYDIR=\/usr\/lib\/x86_64-linux-gnu")/'  R-package/src/install.libs.R && \
    Rscript build_r.R --use-gpu

it is compiling and running OK.

Full Dockerfile: https://github.com/szilard/GBM-perf/blob/f34c37357e82f7dd3d8f30e5625a7f268a3b98a5/gpu/Dockerfile

Full R code running: https://github.com/szilard/GBM-perf/blob/f34c37357e82f7dd3d8f30e5625a7f268a3b98a5/gpu/run/3-lightgbm.R

I wonder if on other systems it works out of the box or not (without adding the paths with the patch) as it used to run for me as well.

StrikerRUS commented 3 years ago

I wonder if on other systems it works out of the box or not (without adding the paths with the patch) as it used to run for me as well.

Those paths are default ones. Very strange that they are not propagated into R...

jameslamb commented 3 years ago

Thanks for such nice reproducible examples @szilard ! I can look into this this weekend, and probably expose more options via the build_r.R command-line args, so you don't have to use sed.

szilard commented 3 years ago

Sounds great @jameslamb, thank you.

jameslamb commented 3 years ago

Thanks to both of you for all the great information, and a nice reproducible example!

I've proposed what I think could be a fix, in https://github.com/microsoft/LightGBM/pull/3779. It wouldn't "just work", but would at least allow you to pass in these paths as command-line args like you can in the Python package, so no one would need to use sed to re-write install.libs.R.

szilard commented 3 years ago

Thanks @jameslamb for fix and merging into LightGBM master. I changed the Dockerfile in my repo GBM-perf to take advantage of this fix (replaced the sed hack with flags to the build script): https://github.com/szilard/GBM-perf/commit/3b56bf0b474edd5dcf8039c9ddd86cddb9c1d845 Thanks.

StrikerRUS commented 3 years ago

@szilard I'm afraid you have a typo (duplicated = sign) in the commit you've linked:

--boost-librarydir==/usr/lib/x86_64-linux-gnu
------------------^--------------

Quite strange that even with typo compilation succeed.

szilard commented 3 years ago

Thanks @StrikerRUS , I fixed it now. Yeah, strange indeed it was compiling with the == as well.

github-actions[bot] commented 1 year ago

This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.