[BUG] CUDA error using GLOBAL_QUANTILE for split_algo (experimental RF backend) #3948

Open Oleg-dM opened 3 years ago

Oleg-dM commented 3 years ago

Describe the bug From the rapids documentation example fitting a RandomForestClassifier on synthetic dataset CUDA error occurs when n_rows is set above exactly 4684:

Works fine using split_algo = 0 (HIST) but is 3 times slower..

Simplest working example

import numpy as np
from cuml.ensemble import RandomForestClassifier as cuRFC

n_rows = 4864 # FAILS ABOVE 4864 -> this looks very much like a bug

X = np.random.normal(size=(n_rows,100)).astype(np.float32)
y = np.asarray([0,1]*(n_rows//2), dtype=np.int32)

cuml_model = cuRFC(max_features=35,

%time cuml_model.fit(X,y)

cuml_predict = cuml_model.predict(X)
RuntimeError                              Traceback (most recent call last)
<timed exec> in <module>

~/anaconda3/envs/rapids-0.19/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, **kwargs)
    407                                 target_val=target_val)
--> 409                 return func(*args, **kwargs)
    411         @ wraps(func)

cuml/ensemble/randomforestclassifier.pyx in cuml.ensemble.randomforestclassifier.RandomForestClassifier.fit()

RuntimeError: CUDA error encountered at: file=../src/decisiontree/quantile/**quantile.cuh line=236:** call='cub::**DeviceRadixSort::SortKeys(** (void *)d_temp_storage->data(), temp_storage_bytes, &data[col_offset], single_column_sorted->data(), n_rows, 0, 8 * sizeof(T), stream)', **Reason=cudaErrorInvalidValue:invalid argument**
Obtained 64 stack frames
#0 in /home/oleg/anaconda3/envs/rapids-0.19/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x46) [0x7f45884d5076]
#1 in /home/oleg/anaconda3/envs/rapids-0.19/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x69) [0x7f45884d57d9]
#2 in /home/oleg/anaconda3/envs/rapids-0.19/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML12DecisionTree16computeQuantilesIfEEvPT_iPKS2_iiSt10shared_ptrIN4raft2mr6device9allocatorEEP11CUstream_st+0x778) [0x7f458889f0d8]

Steps/Code to reproduce bug Running the example linked above by setting n_samples > 4864

Expected behavior model fits using the new backend (split_algo = 1)

Environment details (please complete the following information):

Installation procedure:

  1. Fresh Ubuntu 20.04 install
  2. Blacklist nouveau drivers
  3. sudo sh cuda_11.2.0_460.27.04_linux.run
  4. bash Anaconda3-2021.05-Linux-x86_64.sh
  5. conda create -n rapids-0.19 -c rapidsai -c nvidia -c conda-forge \ rapids-blazing=0.19 python=3.8 cudatoolkit=11.2
Nanthini10 commented 3 years ago

Is it working on collab with n_samples > 6846?

I'm unable to reproduce this with 21.06. Can you try using the latest version of RAPIDS and seeing if the error still persists?

You can install the either from source or using docker as follows docker pull rapidsai/rapidsai-core-dev-nightly:21.06-cuda11.2-devel-ubuntu18.04-py3.8

Oleg-dM commented 3 years ago

Is it working on collab with n_samples > 6846?

I'm unable to reproduce this with 21.06. Can you try using the latest version of RAPIDS and seeing if the error still persists?

You can install the either from source or using docker as follows docker pull rapidsai/rapidsai-core-dev-nightly:21.06-cuda11.2-devel-ubuntu18.04-py3.8

Thanks for the quick answer - tried the docker image and unfortunately still got the same issue (and the RF fitting is much slower than with 0.19)

any other suggestions? I'm kinda losing hope ..

hcho3 commented 3 years ago

@Oleg-dM It appears that the issue is specific to your desktop. I tried setting up RAPIDS fresh on an AWS EC2 virtual machine and the example ran successfully with n_samples=6486. Here is how I set it up.

  1. Create a new EC2 instance with type g4dn.2xlarge.
  2. Install CUDA 11.2 by following directions in https://developer.nvidia.com/cuda-11.2.2-download-archive.
  3. Install Miniconda from https://docs.conda.io/en/latest/miniconda.html
  4. Set up Conda environment with the RAPIDS package by running
    conda create -n rapids -c rapidsai-nightly -c nvidia -c conda-forge rapids=21.06 python=3.8 cudatoolkit=11.2
Oleg-dM commented 3 years ago

Issue identified: error occurs using the experimental backend but works well using split_algo = 0 (HIST) which relies on the default backend.

As recall: error occurs in file quantile.cuh line=236 when calling CUDA function DeviceRadixSort::SortKeys (see original post for details)

@hcho3 @Nanthini10

Also, I tested the documentation example on 2 distinct machines with same fresh from scratch install of ubuntu 20, cuda and conda and error persisted:

hcho3 commented 3 years ago

I just tried running the script using my workstation (Quadro RTX 8000, CUDA 11.0) and could not reproduce the error.

Maybe the error is specific to older generations of graphics cards?

Oleg-dM commented 3 years ago

Thank you @hcho3, do you know who could look into that specific issue? Should we put this into a backlog somehow?

May be @canonizer ?

dumerrill commented 3 years ago

FWIW, usually an cudaErrorInvalidValue error when reported by a call to CUB (or Thrust) is just "coughing up" a latent CUDA Runtime errno left over from some previous operation (e.g., a bad cudaMemcpy()), and has nothing to do with the sort itself.

Oleg-dM commented 3 years ago

Thanks a lot Duane for jumping in. Are RAPIDS guys understaffed? They don't seem to bother about simplest stuffs, feels more and more like a marketing library to make you buy expensive nvidia GPUs..

Have a good week

On Fri, 18 Jun 2021 at 21:43, Duane Merrill @.***> wrote:

FWIW, usually an cudaErrorInvalidValue error when reported by a call to CUB (or Thrust) is just "coughing up" a latent CUDA Runtime errno left over from some previous operation (e.g., a bad cudaMemcpy()), and has nothing to do with the sort itself.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/rapidsai/cuml/issues/3948#issuecomment-864239687, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB73245YUDTROZX34SMCJDTTTOOWBANCNFSM46HPGQCA .

-- Oleg Del Maschio

hcho3 commented 3 years ago

@Oleg-dM We apologize for the delay. We will follow up on this issue as soon as we can.

Oleg-dM commented 3 years ago

@Oleg-dM We apologize for the delay. We will follow up on this issue as soon as we can.

Thank you Philip - do you have an idea of a timeline ? Days or weeks ?

vinaydes commented 3 years ago

Hi @Oleg-dM, I had access to a sm_61 device thus I tried to debug the issue. Here are my observations:

  1. The issue is specific to sm_61 devices. Both your GPUs are sm_61 thats why you see this issue.
  2. Workaround The issue appears only when you install pre-built libcuml from conda channel. If you build from source the issue goes away. Building from source is not super complicated either. All it takes is creating conda environment and invoking ./build.sh. You can find more here https://github.com/rapidsai/cuml/blob/branch-21.08/BUILD.md.
  3. I am currently not sure what is the reason for such a difference between pre-built vs built from source. A key difference between pre-built and built from source is regarding which cuda PTX objects are present in the libcuml. Pre-built has PTX for sm_60 which should work for sm_61, so it should not really matter. However when I built from source for sm_60 (just like pre-built binary) the issue started appearing again. More investigation needed to refine the root cause further.
  4. @dumerrill To eliminate stale errors I added CUDA_CHECK(cudaDeviceSynchronize()) just before line quantile.cuh#L75l. The error was still with the cub::DeviceRadixSort::SortKeys function. Inside the function, kernel cub::DeviceRadixSortDownsweepKernel seems to throw the error at launch.
  5. I could reproduce the error with C++ benchmarking code from cuML, which speeds up the process of debugging. However unlike Python example which fails every time, C++ one fails intermittently.

In short: To @Oleg-dM or anyone else getting affected by this issue could use the workaround described above, while we continue to debug the issue further.

Oleg-dM commented 3 years ago

Amazing thank you Vinay - will give it a try asap

update: @vinaydes I keep running into the error "#error The version of CUB in your include path is not compatible with this release of Thrust. CUB is now included in the CUDA Toolkit, so you no longer need to use your own checkout of CUB. Define THRUST_IGNORE_CUB_VERSION_CHECK to ignore this." Do you know where to define this flag? Wasn't able to - tried the build.sh way and the manual one but no luck.

vinaydes commented 3 years ago

@Oleg-dM It is probably not a good idea to ignore this error. Just to confirm, are these the steps you followed for building the code:

git clone https://github.com/rapidsai/cuml.git
cd cuml
git checkout branch-21.06 # last stable
conda env create --name cuml-dev-11.2 --file conda/environments/cuml_dev_cuda11.2.yml
conda activate cuml-dev-11.2
./build.sh clean

If yes, then can you share the list of packages installed in the environment? You can get the list by activating the environment and then executing conda list.

Oleg-dM commented 3 years ago

@Oleg-dM It is probably not a good idea to ignore this error. Just to confirm, are these the steps you followed for building the code:

git clone https://github.com/rapidsai/cuml.git
cd cuml
git checkout branch-21.06 # last stable
conda env create --name cuml-dev-11.2 --file conda/environments/cuml_dev_cuda11.2.yml
conda activate cuml-dev-11.2
./build.sh clean

If yes, then can you share the list of packages installed in the environment? You can get the list by activating the environment and then executing conda list.

Did follow these exact instruction (except that I downloaded 21.06 sources zip and unzipped manually) and ran into the same error as described above - the compilation starts with the below error (and packages are listed below the error).

Any idea of were the incompatibility could come from? CUB 1.11 should be ok with cuda 11.2 ?

-- Configuring done -- Generating done -- Build files have been written to: /home/oleg/Downloads/cuml-branch-21.06/cpp/build [1/226] Building CUDA object CMakeFiles/cuml++.dir/src/fil/infer.cu.o FAILED: CMakeFiles/cuml++.dir/src/fil/infer.cu.o /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DCUML_CPP_API -DDISABLE_CUSPARSE_DEPRECATED -DDMLC_CORE_USE_CMAKE -DDMLC_USE_CXX11=1 -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -DUSEXOPEN2K8 -DcumlEXPORTS -I../include -I../src -I../src_prims -I/include -I/home/oleg/anaconda3/envs/rapids/include -I_deps/thrust-src -I_deps/thrust-src/dependencies/cub -I_deps/raft-src/cpp/include -isystem=/home/oleg/anaconda3/envs/cuml_dev/include -isystem=/usr/local/cuda/include -isystem=/home/oleg/anaconda3/envs/cuml_dev/include/cumlprims -O3 -DNDEBUG --generate-code=arch=compute_61,code=[sm_61] -Xcompiler=-fPIC --expt-extended-lambda --expt-relaxed-constexpr -Xcompiler=-Wno-deprecated-declarations -Xcompiler=-fopenmp -std=c++17 -MD -MT CMakeFiles/cuml++.dir/src/fil/infer.cu.o -MF CMakeFiles/cuml++.dir/src/fil/infer.cu.o.d -x cu -c ../src/fil/infer.cu -o CMakeFiles/cuml++.dir/src/fil/infer.cu.o In file included from /home/oleg/anaconda3/envs/rapids/include/thrust/system/cuda/detail/execution_policy.h:33, from /home/oleg/anaconda3/envs/rapids/include/thrust/iterator/detail/device_system_tag.h:23, from /home/oleg/anaconda3/envs/rapids/include/thrust/iterator/iterator_traits.h:111, from /home/oleg/anaconda3/envs/rapids/include/thrust/detail/type_traits/pointer_traits.h:23, from /home/oleg/anaconda3/envs/rapids/include/thrust/detail/raw_pointer_cast.h:20, from /home/oleg/anaconda3/envs/rapids/include/thrust/detail/raw_reference_cast.h:20, from /home/oleg/anaconda3/envs/rapids/include/thrust/detail/functional/actor.h:33, from /home/oleg/anaconda3/envs/rapids/include/thrust/detail/functional/placeholder.h:20, from /home/oleg/anaconda3/envs/rapids/include/thrust/functional.h:26, from ../src/fil/infer.cu:20: /home/oleg/anaconda3/envs/rapids/include/thrust/system/cuda/config.h:78:2: error: #error The version of CUB in your include path is not compatible with this release of Thrust. CUB is now included in the CUDA Toolkit, so you no longer need to use your own checkout of CUB. Define THRUST_IGNORE_CUB_VERSION_CHECK to ignore this. 78 | #error The version of CUB in your include path is not compatible with this release of Thrust. CUB is now included in the CUDA Toolkit, so you no longer need to use your own checkout of CUB. Define THRUST_IGNORE_CUB_VERSION_CHECK to ignore this. | ^~~~~

conda list output below:

vinaydes commented 3 years ago

Looks like something related with conda env is messed up on your machine. You seem to have directory from rapids environment in your include path here -I/home/oleg/anaconda3/envs/rapids/include when you are actually using cuml_dev environment. Note that the header that is causing problem is from rapids environment /home/oleg/anaconda3/envs/rapids/include/thrust/system/cuda/detail/execution_policy.h:33. Does your PATH variable have some directories from rapids environment hard coded? May be you can try deleting the rapids environment and fresh build again.

Oleg-dM commented 3 years ago

Managed to compile cuML 20.06 on a fresh Ubuntu install and tested the new backend (split_algo=1) on 2 GTX 1050 -> works like a charm!! (compile ran after deleting env "rapids" but failed later on during compilation).

Thanks a lot @vinaydes and @hcho3 for following-up on this, really appreciated