rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.18k stars 527 forks source link

[QST] Is there a size limit to the input data for the RandomForestClassifier's fit function? #4132

Closed tkpudgy closed 1 year ago

tkpudgy commented 3 years ago

Hello,

Novice here experimenting with random forest classifier. I'm following a simple example that I found on the web. The code is shown below. As I change the number of rows to the input data size. I'm noticing that there is an upper limit before the RandomForestClassifier.fit(X,y) function throws a fit (example below). In this particular environment I'm working in, row size of 4862 or less works but errors out when it's higher. Any information on why this is and a possible work around would be very much appreciated!

Code

import numpy as np from cuml.ensemble import RandomForestClassifier as cuRFC num = 5000 X = np.random.normal(size=(num,4)).astype(np.float32) y = np.asarray([0,1]*int(num/2), dtype=np.int32) cu_rf_params = {'n_estimators': 16, 'max_depth': 6, 'n_bins': 2} cuml_model = cuRFC(**cu_rf_params) cuml_model.fit(X,y) cuml_predict = cuml_model.predict(X)

Error (Note that the error looked different yesterday, something to do with sorting)


RuntimeError Traceback (most recent call last) /tmp/ipykernel_64207/2590491031.py in 6 cu_rf_params = {'n_estimators': 16, 'max_depth': 6, 'n_bins': 2} 7 cuml_model = cuRFC(**cu_rf_params) ----> 8 cuml_model.fit(X,y) 9 cuml_predict = cuml_model.predict(X) 10

~/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/internals/api_decorators.py in inner_with_setters(*args, *kwargs) 407 target_val=target_val) 408 --> 409 return func(args, **kwargs) 410 411 @wraps(func)

cuml/ensemble/randomforestclassifier.pyx in cuml.ensemble.randomforestclassifier.RandomForestClassifier.fit()

RuntimeError: CUDA error encountered at: file=../src/decisiontree/quantile/quantile.cuh line=234: call='cub::DeviceRadixSort::SortKeys( (void )d_temp_storage->data(), temp_storage_bytes, &data[col_offset], single_column_sorted->data(), n_rows, 0, 8 sizeof(T), stream)', Reason=cudaErrorInvalidValue:invalid argument Obtained 64 stack frames

0 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x4e) [0x7fdcc4f5318e]

1 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x6a) [0x7fdcc4f5395a]

2 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML12DecisionTree16computeQuantilesIfEEvPT_iPKS2_iiSt10shared_ptrIN4raft2mr6device9allocatorEEP11CUstream_st+0x71e) [0x7fdcc53f4a8e]

3 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML12rfClassifierIfE3fitERKN4raft8handle_tEPKfiiPiiRPNS_20RandomForestMetaDataIfiEE+0xf84) [0x7fdcc5403f64]

4 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3fitERKN4raft8handle_tERPNS_20RandomForestMetaDataIfiEEPfiiPiiNS_9RF_paramsEi+0x2b4) [0x7fdcc53d59d4]

5 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/site-packages/cuml/ensemble/randomforestclassifier.cpython-38-x86_64-linux-gnu.so(+0x37986) [0x7fdca7ffa986]

6 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(PyObject_Call+0x24d) [0x55d3c135fd5d]

7 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55d3c140f84f]

8 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55d3c13f4433]

9 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x1b7f47) [0x55d3c13f5f47]

10 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x4d33) [0x55d3c14123c3]

11 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55d3c13f4433]

12 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(PyEval_EvalCodeEx+0x39) [0x55d3c13f5499]

13 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(PyEval_EvalCode+0x1b) [0x55d3c1490ecb]

14 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x273c4e) [0x55d3c14b1c4e]

15 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x12488b) [0x55d3c136288b]

16 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55d3c140dfd7]

17 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

18 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1d9d) [0x55d3c140f42d]

19 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

20 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1d9d) [0x55d3c140f42d]

21 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

22 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x190569) [0x55d3c13ce569]

23 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

24 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

25 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55d3c140dfd7]

26 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

27 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

28 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55d3c13f4433]

29 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x378) [0x55d3c13f5818]

30 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x1b7ed1) [0x55d3c13f5ed1]

31 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(PyObject_Call+0x5e) [0x55d3c135fb6e]

32 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55d3c140f84f]

33 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55d3c13f4433]

34 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x1b7f47) [0x55d3c13f5f47]

35 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1822) [0x55d3c140eeb2]

36 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

37 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1d9d) [0x55d3c140f42d]

38 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

39 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1d9d) [0x55d3c140f42d]

40 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

41 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1d9d) [0x55d3c140f42d]

42 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

43 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x1d9d) [0x55d3c140f42d]

44 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x17ffc3) [0x55d3c13bdfc3]

45 in /home/jeon2/miniconda/envs/rapids-21.06/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0xa8a6) [0x7fdf52ba78a6]

46 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyObject_MakeTpCall+0x31e) [0x55d3c1370ebe]

47 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x21adef) [0x55d3c1458def]

48 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x124b02) [0x55d3c1362b02]

49 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(PyVectorcall_Call+0x6e) [0x55d3c136d81e]

50 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x5c4a) [0x55d3c14132da]

51 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

52 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

53 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

54 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

55 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

56 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

57 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

58 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

59 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55d3c13f5646]

60 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55d3c140e0f3]

61 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55d3c13f4433]

62 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(+0x1b7f47) [0x55d3c13f5f47]

63 in /home/jeon2/miniconda/envs/rapids-21.06/bin/python(_PyEval_EvalFrameDefault+0x4d33) [0x55d3c14123c3]

vinaydes commented 3 years ago

There should not be any size limit as such. Your issue seems duplicate of #3948 Can you provide some details of your environment? Like what version of NVIDIA driver is installed in the system (you can find that by running nvidia-smi), GPU you are using for this sample on and how you have installed cuML (conda, from source etc). If you feel like your issue is similar #3948, you can try the workaround I mentioned in the threads. Its basically to build cuML from source instead of installing it via conda.

tkpudgy commented 3 years ago

There should not be any size limit as such. Your issue seems duplicate of #3948 Can you provide some details of your environment? Like what version of NVIDIA driver is installed in the system (you can find that by running nvidia-smi), GPU you are using for this sample on and how you have installed cuML (conda, from source etc). If you feel like your issue is similar #3948, you can try the workaround I mentioned in the threads. Its basically to build cuML from source instead of installing it via conda.

Thank you so much for the reference to a related post. It sounds very much like a similar problem. From what I can understand from the post, this was an installation related issue. I will try it on Monday and provide an update. Thank you! Below is the output of nvidia-smi:

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 465.19.01 Driver Version: 465.19.01 CUDA Version: 11.3 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA TITAN X ... Off | 00000000:02:00.0 Off | N/A | | 23% 31C P2 51W / 250W | 1161MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA TITAN X ... Off | 00000000:03:00.0 Off | N/A | | 23% 26C P0 52W / 250W | 2MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA TITAN X ... Off | 00000000:81:00.0 Off | N/A | | 23% 25C P0 51W / 250W | 2MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA TITAN X ... Off | 00000000:82:00.0 Off | N/A | | 23% 26C P0 52W / 250W | 2MiB / 12196MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 63268 C ...s/actici-hf-44/bin/python 1159MiB | +-----------------------------------------------------------------------------+

vinaydes commented 3 years ago

Well, the solution I suggested in the issue #3948 was to compile cuML from source instead of using the binaries from conda channel. Detailed instructions for building from source are in this comment. Then the user had some environment issues which they solved by fresh install. So I suggest same for you, try building cuML from source and use that instead of conda channel.

tkpudgy commented 3 years ago

@vinaydes, here is a quick update. I don't have 'sudo' priviege on the system. It was a big hurdle to get cuML from source installed. However, after this new environment was created, I'm still facing the same CUDA error as above. I don't have any path forward at the moment so I'm putting this on hold. Any more information on this would be greatly appreciated. Thank you once again for your help!

vinaydes commented 3 years ago

@tkpudgy Thanks for trying the work around. I'll try to reproduce it one more time and see if I can debug in my local environment.

tkpudgy commented 3 years ago

Adding one more detail. Given the limitation on the row size for the .fit() function, it's interesting that there doesn't seem to be such limit on the .predict() function that follows in my code. I've tested up to 50,000 rows and .predict() works without any problems.

venkywonka commented 3 years ago

hey @tkpudgy , apologies for the delay in response. For some reason, this error only happens in branch-21.06 and vanishes in the later (branch-21.08 and nightly) builds. I was able to reproduce your error on a branch-21.06 rapids docker container using the following steps:

Below is how I got the docker container running (which has cuml installed so no need to build from source) ``` docker pull rapidsai/rapidsai-core:21.06-cuda11.2-runtime-ubuntu18.04-py3.8; docker run --gpus all --rm -it -p 8888:8888 -p 8787:8787 -p 8786:8786 \ rapidsai/rapidsai-core:21.06-cuda11.2-runtime-ubuntu18.04-py3.8; ```
After spinning up the container, I just ran the python script you had posted in the description of this issue ``` python test.py ```
Ran into the same error ``` (rapids) root@604aa6a2f36a:/rapids# python test.py Traceback (most recent call last): File "test.py", line 13, in cuml_model.fit(X,y) File "/opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters return func(*args, **kwargs) File "cuml/ensemble/randomforestclassifier.pyx", line 522, in cuml.ensemble.randomforestclassifier.RandomForestClassifier.fit RuntimeError: CUDA error encountered at: file=../src/decisiontree/quantile/quantile.cuh line=234: call='cub::DeviceRadixSort::SortKeys( (void *)d_temp_storage->data(), temp_storage_bytes, &data[col_offset], single_column_sorted->data(), n_rows, 0, 8 * sizeof(T), stream)', Reason=cudaErrorInvalidValue:invalid argument Obtained 22 stack frames #0 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x4e) [0x7f8686a6a18e] #1 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft10cuda_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0x6a) [0x7f8686a6a95a] #2 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML12DecisionTree16computeQuantilesIfEEvPT_iPKS2_iiSt10shared_ptrIN4raft2mr6device9allocatorEEP11CUstream_st+0x71e) [0x7f8686f0ba8e] #3 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML12rfClassifierIfE3fitERKN4raft8handle_tEPKfiiPiiRPNS_20RandomForestMetaDataIfiEE+0xf84) [0x7f8686f1af64] #4 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3fitERKN4raft8handle_tERPNS_20RandomForestMetaDataIfiEEPfiiPiiNS_9RF_paramsEi+0x2b4) [0x7f8686eec9d4] #5 in /opt/conda/envs/rapids/lib/python3.8/site-packages/cuml/ensemble/randomforestclassifier.cpython-38-x86_64-linux-gnu.so(+0x37986) [0x7f86cc76d986] #6 in python(PyObject_Call+0x24d) [0x55859c1ebd5d] #7 in python(_PyEval_EvalFrameDefault+0x21bf) [0x55859c29b84f] #8 in python(_PyEval_EvalCodeWithName+0x2c3) [0x55859c280433] #9 in python(+0x1b7f47) [0x55859c281f47] #10 in python(_PyEval_EvalFrameDefault+0x4d33) [0x55859c29e3c3] #11 in python(_PyEval_EvalCodeWithName+0x2c3) [0x55859c280433] #12 in python(PyEval_EvalCodeEx+0x39) [0x55859c281499] #13 in python(PyEval_EvalCode+0x1b) [0x55859c31cecb] #14 in python(+0x252f63) [0x55859c31cf63] #15 in python(+0x26f033) [0x55859c339033] #16 in python(+0x274022) [0x55859c33e022] #17 in python(PyRun_SimpleFileExFlags+0x1b2) [0x55859c33e202] #18 in python(Py_RunMain+0x36d) [0x55859c33e77d] #19 in python(Py_BytesMain+0x39) [0x55859c33e939] #20 in /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7) [0x7f87da3cbbf7] #21 in python(+0x1e8f39) [0x55859c2b2f39] ```

But, when I did the same for branch-21.08 (just replace occurrences of 21.06 with 21.08 in the above docker commands), it disappeared. Hopefully using cuml's stable public version 21.08 would solve your problem ✌🏻

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

vinaydes commented 1 year ago

This is an old issue and not reproducible, closing.