Closed seunghwak closed 4 years ago
We need to replace all the direct calls to RMM_ALLOC and malloc with memory allocations with RAII objects. [...] we should use std::vector, rmm:device_vector and rmm::device_buffer
cuML is also moving and generalizing its device buffer in RAFT. Seems like we could use that too and rely on it for interacting with RMM properly. What do you think?
It depends on what we see in RAFT and RAPIDS policies.
The current approach(https://github.com/rapidsai/cuml/blob/branch-0.11/wiki/cpp/DEVELOPER_GUIDE.md#device-and-host-memory-allocations) is not exception safe (still requires explicit deallocate calls).
If cugraph is also expected to be based on RAFT, then we should try our best to adopt RAFT (in a way we are currently trying our best to adopt cudf & rmm) as much as possible. If not, I think using cudf & rmm containers (resource handles) is good enough.
I just found there is MLCommon::device_buffer and if we are encouraged use raft::device_buffer (which might be a wrapper for rmm::device_buffer), then, we should, but I am a bit hesitant about adding one more abstraction layer unless there is a clear benefit (may be stream management?).
RMM_ALLOC is one source of memory leak on exceptions, but there are more (e.g. malloc, new, and cudaMalloc, and ALLOC_TRY).
And we're using these extensively all over the codebase. Fixing all these will touch possibly over a hundred of files and can lead to many PR merge conflicts.
I think it is dangerous to fix all these in this release.
closed by #897
Describe the bug If we allocate memory through malloc or RMM_ALLOC, this memory block doesn't get automatically freed if exception occurs between memory allocation and memory deallocation. We need to replace all the direct calls to RMM_ALLOC and malloc with memory allocations with RAII objects.
In the planned code restructuring phase, we should use std::vector, rmm:device_vector and rmm::device_buffer (https://github.com/rapidsai/rmm/pull/99, PR submitted, haven't merged, yet) where-ever possible (and at least smart-pointers which automatically release memory on stack unwinding).