rapidsai / rmm

RAPIDS Memory Manager
https://docs.rapids.ai/api/rmm/stable/
Apache License 2.0
478 stars 195 forks source link

[BUG] System MR causes segfault #1656

Closed rongou closed 1 month ago

rongou commented 1 month ago

Describe the bug

While investigating cuml benchmarks, I found an issue with the current system_memory_resource that causes segfault. Roughly it's in code like this:

void foo(...) {
  rmm::device_uvector<T> tmp(bufferSize, stream);
  // launch cuda kernels making use of tmp
}

When the function returns, the device_uvector would go out of scope and get deleted, while the cuda kernel might still be in flight. With cudaFree, the CUDA runtime would perform implicit synchronization to make sure the kernel finishes before actually freeing the memory, but with SAM we don't have that guarantee, thus causing use-after-free errors.

Steps/Code to reproduce bug This was discovered by running the Spark RAPIDS ML benchmark with system mr enabled.

Expected behavior Should not segfault.

Environment details (please complete the following information):