Open VibhuJawa opened 2 years ago
I can reproduce, though the error doesn't always manifest exactly the same, it is in not in the initialization of cublas or cuolver always. I got the following error:
(ns0113) ➜ python git:(branch-22.02) ✗ python repro.py
CUBLAS call='cublasCreate(&cublas_handle_)' at file=_deps/raft-src/cpp/include/raft/handle.hpp line=87 failed with CUBLAS_STATUS_NOT_INITIALIZED
This issue has been labeled inactive-30d
due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d
if there is no activity in the next 60 days.
This issue has been labeled inactive-90d
due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.
Describe the bug I am observing we use
426 Mib
memory outside the pool when training/using a cuML model.See MRE below (trace here) where we throw an CUSOLVER_STATUS_INTERNAL_ERROR when we set pool to a limit near the devices memory limit(15109MiB in this case) . Please note that, this works if set pool to a smaller value or don't set one at all.
Steps/Code to reproduce bug
Expected behavior
I would expect us to use the RMM Pool
Additional Context: This seems to be cause of problems in a dask-sql+dask-ml workflow where the pool grows to maximum device memory ( which is the default behavior) causing problems with the ML inference.
CC: @randerzander