Hi,

I am encountering an issue when selecting a GPU using cp.cuda.Device(2).use(). When I do not specify the GPU device, the script runs without errors.

Description:

I am using RAPIDS 24.06, CUDA 12.4, and Python 3.9. I encounter a raft::cuda_error when using cuml's RandomForestClassifier with GPU devices.

Code:

import cupy as cp import os import cudf import cuml import pandas as pd from sklearn import model_selection from cuml import datasets import dask from dask.distributed import Client, wait from dask_cuda import LocalCUDACluster from dask.utils import parse_bytes from numba import cuda import dask_cudf from cuml.ensemble import RandomForestClassifier as cuRFC from cuml import ForestInference import joblib from tqdm import tqdm from scipy import stats from sklearn import metrics import pickle from collections import Counter import random import shutil import time import gc import warnings import numpy as np import multiprocessing

cp.cuda.Device(2).use() model_parameter = cuRFC(n_estimators=500, max_features='log2', random_state=seed)

Error Message:

CURFC /home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/site-packages/cuml/internals/api_decorators.py:344: UserWarning: For reproducible results in Random Forest Classifier or for almost reproducible results in Random Forest Regressor, n_streams=1 is recommended. If n_streams is > 1, results may vary due to stream/thread timing differences, even when random_state is set return func(*kwargs) terminate called after throwing an instance of 'raft::cuda_error' what(): CUDA error encountered at: file=/opt/conda/conda-bld/work/cpp/src/decisiontree/batched-levelalgo/builder.cuh line=331: call='cudaMemsetAsync(done_count, 0, sizeof(int) max_batch * n_col_blks, builder_stream)', Reason=cudaErrorInvalidValue:invalid argument Obtained 7 stack frames

1 in /home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/site-packages/cuml/internals/../../../../libcuml++.so: raft::cuda_error::cuda_error(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) +0x5a [0x767aa52af28a]

2 in /home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/site-packages/cuml/internals/../../../../libcuml++.so: ML::DT::Builder<ML::DT::GiniObjectiveFunction<float, int, int> >::assignWorkspace(char, char) +0x308 [0x767aa5dc13e8]

3 in /home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/site-packages/cuml/internals/../../../../libcuml++.so: ML::DT::Builder<ML::DT::GiniObjectiveFunction<float, int, int> >::Builder(raft::handle_t const&, CUstream_st, int, unsigned long, ML::DT::DecisionTreeParams const&, float const, int const, int, int, rmm::device_uvector, int, ML::DT::Quantiles<float, int> const&) +0x2fc [0x767aa5dc19cc]

4 in /home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/site-packages/cuml/internals/../../../../libcuml++.so(+0xdf021f) [0x767aa5df021f]

5 in /home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/site-packages/sklearn/utils/../../../../libgomp.so.1(+0x18f09) [0x767abecbbf09]

6 in /lib/x86_64-linux-gnu/libc.so.6(+0x94ac3) [0x767ed8294ac3]

7 in /lib/x86_64-linux-gnu/libc.so.6(+0x126850) [0x767ed8326850]

/home1/rhlin/anaconda3/envs/rapids-24.06/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 24 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' Aborted (core dumped)

Any suggestions for resolving this issue?

Thank you so much.

rapidsai / cuml

[QST] Encountering raft::cuda_error with cuML's RandomForestClassifier on GPU. (cp.cuda.Device(2).use()) #5983