rapidsai / cugraph

cuGraph - RAPIDS Graph Analytics Library
https://docs.rapids.ai/api/cugraph/stable/
Apache License 2.0
1.68k stars 298 forks source link

[ENH] Enforce exception safety for memory allocations. #459

Closed seunghwak closed 4 years ago

seunghwak commented 5 years ago

Describe the bug If we allocate memory through malloc or RMM_ALLOC, this memory block doesn't get automatically freed if exception occurs between memory allocation and memory deallocation. We need to replace all the direct calls to RMM_ALLOC and malloc with memory allocations with RAII objects.

In the planned code restructuring phase, we should use std::vector, rmm:device_vector and rmm::device_buffer (https://github.com/rapidsai/rmm/pull/99, PR submitted, haven't merged, yet) where-ever possible (and at least smart-pointers which automatically release memory on stack unwinding).

afender commented 4 years ago

We need to replace all the direct calls to RMM_ALLOC and malloc with memory allocations with RAII objects. [...] we should use std::vector, rmm:device_vector and rmm::device_buffer

cuML is also moving and generalizing its device buffer in RAFT. Seems like we could use that too and rely on it for interacting with RMM properly. What do you think?

seunghwak commented 4 years ago

It depends on what we see in RAFT and RAPIDS policies.

The current approach(https://github.com/rapidsai/cuml/blob/branch-0.11/wiki/cpp/DEVELOPER_GUIDE.md#device-and-host-memory-allocations) is not exception safe (still requires explicit deallocate calls).

If cugraph is also expected to be based on RAFT, then we should try our best to adopt RAFT (in a way we are currently trying our best to adopt cudf & rmm) as much as possible. If not, I think using cudf & rmm containers (resource handles) is good enough.

seunghwak commented 4 years ago

I just found there is MLCommon::device_buffer and if we are encouraged use raft::device_buffer (which might be a wrapper for rmm::device_buffer), then, we should, but I am a bit hesitant about adding one more abstraction layer unless there is a clear benefit (may be stream management?).

seunghwak commented 4 years ago

RMM_ALLOC is one source of memory leak on exceptions, but there are more (e.g. malloc, new, and cudaMalloc, and ALLOC_TRY).

And we're using these extensively all over the codebase. Fixing all these will touch possibly over a hundred of files and can lead to many PR merge conflicts.

I think it is dangerous to fix all these in this release.

BradReesWork commented 4 years ago

closed by #897