rapidsai / cuml

cuML - RAPIDS Machine Learning Library
https://docs.rapids.ai/api/cuml/stable/
Apache License 2.0
4.07k stars 525 forks source link

[BUG]cuML using memory outside of RMM Pool #4485

Open VibhuJawa opened 2 years ago

VibhuJawa commented 2 years ago

Describe the bug I am observing we use 426 Mib memory outside the pool when training/using a cuML model.

See MRE below (trace here) where we throw an CUSOLVER_STATUS_INTERNAL_ERROR when we set pool to a limit near the devices memory limit(15109MiB in this case) . Please note that, this works if set pool to a smaller value or don't set one at all.

Steps/Code to reproduce bug

from cuml.linear_model import LinearRegression
import cudf
import rmm

# Fails when pool>= 14.495 MiB  (>=13.5*(2**30))
# works with pool=12.5*(2**30)
rmm.rmm.reinitialize(pool_allocator=True, initial_pool_size=13.5*(2**30))

X = cudf.DataFrame({'c_1':[1.01]*40_000,
                    'c_2':[10.01]*40_000})
y = cudf.Series([6.0]*40_000)

model = LinearRegression(fit_intercept=True)
model = model.fit(X,y)
```python RuntimeError: cuSOLVER error encountered at: file=_deps/raft-src/cpp/include/raft/linalg/cusolver_wrappers.h line=1405: call='cusolverDnSetStream(handle, stream)', Reason=7:CUSOLVER_STATUS_INTERNAL_ERROR Obtained 64 stack frames #0 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft9exception18collect_call_stackEv+0x3b) [0x7f7fc7caff3b] #1 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft14cusolver_errorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xbd) [0x7f7fc7dbf74d] #2 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN4raft6linalg5eigDCIdEEvRKNS_8handle_tEPKT_mmPS5_S8_P11CUstream_st+0xf41) [0x7f7fc7ed7061] #3 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN8MLCommon6LinAlg8lstsqEigIdEEvRKN4raft8handle_tEPKT_iiS8_PS6_P11CUstream_st+0x543) [0x7f7fc7f0e8d3] #4 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3GLM6olsFitIdEEvRKN4raft8handle_tEPT_iiS7_S7_S7_bbP11CUstream_sti+0x1e7) [0x7f7fc7f0f577] #5 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/common/../../../../libcuml++.so(_ZN2ML3GLM6olsFitERKN4raft8handle_tEPdiiS5_S5_S5_bbi+0x24) [0x7f7fc7e7de14] #6 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/site-packages/cuml/linear_model/linear_regression.cpython-38-x86_64-linux-gnu.so(+0x2a3d2) [0x7f82fb5803d2] #7 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyObject_Call+0x24d) [0x55f179d6935d] #8 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55f179e124ef] #9 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3] #10 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b08b7) [0x55f179df48b7] #11 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x4e03) [0x55f179e15133] #12 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3] #13 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyEval_EvalCodeEx+0x39) [0x55f179df3e19] #14 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyEval_EvalCode+0x1b) [0x55f179e9624b] #15 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x27318e) [0x55f179eb718e] #16 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x128e0b) [0x55f179d6ce0b] #17 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55f179e10c77] #18 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #19 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf] #20 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #21 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf] #22 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #23 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1951f9) [0x55f179dd91f9] #24 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #25 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #26 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x947) [0x55f179e10c77] #27 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #28 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #29 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3] #30 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x378) [0x55f179df4198] #31 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b0841) [0x55f179df4841] #32 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyObject_Call+0x5e) [0x55f179d6916e] #33 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x21bf) [0x55f179e124ef] #34 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3] #35 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b08b7) [0x55f179df48b7] #36 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x181e) [0x55f179e11b4e] #37 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #38 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf] #39 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #40 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf] #41 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #42 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf] #43 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #44 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x1d9f) [0x55f179e120cf] #45 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1806f3) [0x55f179dc46f3] #46 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/lib/python3.8/lib-dynload/_asyncio.cpython-38-x86_64-linux-gnu.so(+0xa886) [0x7f8367771886] #47 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyObject_MakeTpCall+0x31e) [0x55f179d7f30e] #48 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x21beaf) [0x55f179e5feaf] #49 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x129082) [0x55f179d6d082] #50 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(PyVectorcall_Call+0x6e) [0x55f179d6fe4e] #51 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0x5f25) [0x55f179e16255] #52 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #53 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #54 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #55 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #56 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #57 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #58 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #59 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #60 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyFunction_Vectorcall+0x1a6) [0x55f179df3fc6] #61 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalFrameDefault+0xa63) [0x55f179e10d93] #62 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(_PyEval_EvalCodeWithName+0x2c3) [0x55f179df2db3] #63 in /datasets/vjawa/miniconda3/envs/rapids-22.02-dask-sql/bin/python(+0x1b08b7) [0x55f179df48b7] ```

Expected behavior

I would expect us to use the RMM Pool

Additional Context: This seems to be cause of problems in a dask-sql+dask-ml workflow where the pool grows to maximum device memory ( which is the default behavior) causing problems with the ML inference.

CC: @randerzander

dantegd commented 2 years ago

I can reproduce, though the error doesn't always manifest exactly the same, it is in not in the initialization of cublas or cuolver always. I got the following error:

(ns0113) ➜  python git:(branch-22.02) ✗ python repro.py
CUBLAS call='cublasCreate(&cublas_handle_)' at file=_deps/raft-src/cpp/include/raft/handle.hpp line=87 failed with CUBLAS_STATUS_NOT_INITIALIZED
github-actions[bot] commented 2 years ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

github-actions[bot] commented 2 years ago

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.