rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.16k stars 525 forks source link

[QST] How to enable CPU-only inference for FIL #5984

Closed Kaiyang-Chen closed 1 month ago

Kaiyang-Chen commented 1 month ago

What is your question?

I want to use FIL for CPU only inference task, the reason I don't use GPU is that my streaming batch size is relatively small, so the data transfer time between CPU and GPU is unbearable for my latency-critical application.

The first option I tried was to directly create a conda environment with rapidsai-nightly 24.08. But it seems the libcuml++ is not compiled with flag CUML_ENABLE_GPU=OFF, it will raise error if I set the CUDA_VISIBLE_DEVICE=''. Code:

from cuml.experimental import ForestInference
import numpy as np
from cuml.common.device_selection import set_global_device_type, get_global_device_type, using_device_type

set_global_device_type('cpu')
print('new device type:', get_global_device_type())

X = np.random.rand(1, 210)
with using_device_type('cpu'):
    fm = ForestInference.load("./dat/model_20240725.txt")
    y_out = fm.predict(X)
    print(y_out)

Output:

raceback (most recent call last):
  File "/xxx/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 254, in ensure_initialized
    self.cuInit(0)
  File "/xxx/rapidsai-24.08/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 327, in safe_cuda_api_call
    self._check_ctypes_error(fname, retcode)
  File "/xxx/rapidsai-24.08/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 395, in _check_ctypes_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [100] Call to cuInit results in CUDA_ERROR_NO_DEVICE

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 4, in <module>
  File "/xxx/rapidsai-24.08/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 292, in __getattr__
    self.ensure_initialized()
  File "/xxx/rapidsai-24.08/lib/python3.10/site-packages/numba/cuda/cudadrv/driver.py", line 258, in ensure_initialized
    raise CudaSupportError(f"Error at driver init: {description}")
numba.cuda.cudadrv.error.CudaSupportError: Error at driver init: Call to cuInit results in CUDA_ERROR_NO_DEVICE (100)

So I try to build cuml from source with CUML_ENABLE_GPU=OFF, but this time it failed because the import path actually go through CUDA related code while these part of code is not compiled by cuml (I Guess?).

raceback (most recent call last):
  File "/xxx/cbondsig/optrade/fil.py", line 4, in <module>
    from cuml.experimental import ForestInference
  File "/xxx/cuml/python/cuml/cuml/__init__.py", line 17, in <module>
    from cuml.internals.base import Base, UniversalBase
  File "/xxx/cuml/python/cuml/cuml/internals/__init__.py", line 17, in <module>
    from cuml.internals.available_devices import is_cuda_available
  File "/xxx/cuml/python/cuml/cuml/internals/available_devices.py", line 16, in <module>
    from cuml.internals.device_support import GPU_ENABLED
ModuleNotFoundError: No module named 'cuml.internals.device_support'

The Python scripts is just a convenient way to check whether the package is functional. I am actually using the c_api for the cuml::fil.

void Load_Model(const std::string& filename) {
        // Load the LightGBM model using Treelite
        auto tl_model = treelite::model_loader::LoadLightGBMModel(filename);

        // Import the Treelite model into FIL
        forest = std::make_unique<ML::experimental::fil::forest_model>(
        ML::experimental::fil::import_from_treelite_model(
            *tl_model,                         // The Treelite model
            ML::experimental::fil::tree_layout::depth_first,   // Tree layout
            128u,                              // Align bytes
            true,                             // Use double precision
            raft_proto::device_type::cpu      // Memory type (CPU)    
        )
    );
    auto handle = raft_proto::handle_t{};

}

void Predict(std::vector<double>& avec, double* output) {
    // Perform prediction
    forest->predict(handle, output, avec.data(), 1,  raft_proto::device_type::cpu,  raft_proto::device_type::cpu);
}

And I execute my project with LD_PRELOAD, as I can only build libcuml++.so from the source, the libcuml.so and libcumlprims_mg.so are from original rapidsai conda environment, and I will face problem as

symbol lookup error: /xxx/rapidsai-24.08/lib/libcuml.so: undefined symbol: _ZN2ML3GLM5qnFitIfiEEvRKN4raft8handle_tERKNS0_9qn_paramsEPT_bSA_T0_SB_SB_SA_SA_PiSA_S9_

So my questions are:

  1. Do we have a convenient way to install the cpu-only build for cuml?
  2. If not, after I build the CPU only cuml, how should I integrate it back to the rapids project (link them together)?
  3. For my cpp implementation, is it enough to state my execution on CPU by using raft_proto::device_type::cpu is each function call? Or other thing like some config or environment variables are needed?
  4. BTW, I saw the cpu inference code are intergrated from treelite, is there any further performance improvment right now? Is performance boosting for fil CPU inference on the road map for the team?

Thanks for your help, Kaiyang

hcho3 commented 1 month ago

Have you considered using FIL with Triton server? It offers a convenient method to serve tree models with CPUs and GPUs, and you can easily switch between them.

https://github.com/triton-inference-server/fil_backend

Kaiyang-Chen commented 1 month ago

Have you considered using FIL with Triton server? It offers a convenient method to serve tree models with CPUs and GPUs, and you can easily switch between them.

https://github.com/triton-inference-server/fil_backend

Hi, Thanks for your reply. I had looked into that. What I want is actually not a inference backend, the fil inference is only a small part of my workflow and I don't want it to include any extra socket handling to the inference backend like triton. I don't see there's a clear c_api for fil_backend, or do you mean I should go through the code and leverage the api there?

By the way, from the CPU inference performance pespective, is fil much better than treelite(tl2cgen)? Or they behave similarly right now?

hcho3 commented 1 month ago

Do we have a convenient way to install the cpu-only build for cuml? If not, after I build the CPU only cuml, how should I integrate it back to the rapids project (link them together)?

Currently, the CUML_ENABLE_GPU option is recognized only by FIL. Even with CUML_ENABLE_GPU specified, the build script will still attempt to build *.cu files because this flag is not properly recognized by other submodules in cuML. So for best results, you should also add -DCUML_ALGORITHMS=FIL in order to exclude submodules other than FIL.

Build FIL, CPU only:

cmake -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_CUML_C_LIBRARY=ON -DCUML_ALGORITHMS=FIL -DBUILD_CUML_TESTS=OFF \
    -DBUILD_CUML_MPI_COMMS=OFF -DBUILD_CUML_MG_TESTS=OFF  \
    -DCUML_USE_TREELITE_STATIC=OFF -DNVTX=OFF -DUSE_CCACHE=OFF \
    -DDISABLE_DEPRECATION_WARNINGS=ON -DCMAKE_PREFIX_PATH=$CONDA_PREFIX   \
    -DCUML_ENABLE_GPU=OFF ..
make

(assuming that you have a Conda environment set up according to BUILD.md)

Build FIL, with GPU code:

cmake -DCMAKE_INSTALL_PREFIX=$CONDA_PREFIX -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_CUML_C_LIBRARY=ON -DCUML_ALGORITHMS=FIL -DBUILD_CUML_TESTS=OFF \
    -DBUILD_CUML_MPI_COMMS=OFF -DBUILD_CUML_MG_TESTS=OFF  \
    -DCUML_USE_TREELITE_STATIC=OFF -DNVTX=OFF -DUSE_CCACHE=OFF \
    -DDISABLE_DEPRECATION_WARNINGS=ON -DCMAKE_PREFIX_PATH=$CONDA_PREFIX   \
    -DCUML_ENABLE_GPU=ON ..
make

For my cpp implementation, is it enough to state my execution on CPU by using raft_proto::device_type::cpu is each function call? Or other thing like some config or environment variables are needed?

Specifying raft_proto::device_type::cpu should be enough.

I saw the cpu inference code are intergrated from treelite, is there any further performance improvment right now? Is performance boosting for fil CPU inference on the road map for the team? from the CPU inference performance pespective, is fil much better than treelite(tl2cgen)?

Currently, FIL offers an "experimental" implementation of CPU inference. FIL used to be GPU only but we are in the process of switching to a unified implementation that can use both CPU and GPU. I won't claim state of the art CPU performance for the experimental FIL, since the goal is to enable users to seamlessly switch between CPUs and GPUs in their inference stack.

hcho3 commented 1 month ago

the libcuml.so and libcumlprims_mg.so are from original rapidsai conda environment, and I will face problem as

I just tried building cuML with -DCUML_ALGORITHMS=FIL -DCUML_ENABLE_GPU=OFF flags and got a minimal libcuml++.so that only depends on system libs:

$ ldd libcuml++.so
        linux-vdso.so.1 (0x00007ffdf5fe2000)
        libgomp.so.1 => /home/phcho/mambaforge/envs/cuml_dev/lib/libgomp.so.1 (0x0000711c76529000)
        libstdc++.so.6 => /home/phcho/mambaforge/envs/cuml_dev/lib/libstdc++.so.6 (0x0000711c76346000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x0000711c76249000)
        libgcc_s.so.1 => /home/phcho/mambaforge/envs/cuml_dev/lib/libgcc_s.so.1 (0x0000711c7622a000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x0000711c75e00000)
        /lib64/ld-linux-x86-64.so.2 (0x0000711c765f4000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x0000711c76223000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x0000711c7