Open JBlaschke opened 3 years ago
What you describe is a real problem. And given the current CUDA API, it's difficult to do better than what's currently implemented in PyCUDA. The following are the constraints:
The only "bulletproof" solution that comes to mind is to maintain a per-context queue of "memory-yet-to-be-deallocated". I'd be open to considering patches implementing this if they are thoroughly tested. If you are able to migrate to a less stateful API (OpenCL?), this problem would disappear on its own.
If you can ensure that no GPU memory gets freed outside your GPUTaskWrapper
, then what you have should also work. Another option would be to preallocate all GPU memory you need and hold onto it until program completion, if that's an option.
Thanks for the feedback @inducer I will look at ways to provide a PR. And I share your frustration with CUDA's context management. In the meantime, @elliotslaughter 's solution above is something that works for our needs (for now)
I wonder if this would be simpler to fix in situations where there is ever only one context per GPU (i.e. the way that the CUDA runtime API does it with primary contexts). I.e. then GPU memory could always be freed by using the on-and-only context. Essentially this simpler model would clean up after users if they are only using the primary context, and leave the work of freeing memory up to them if they have fancier uses of multiple contexts.
where there is ever only one context per GPU
But if only one thread can have that context as its active one, how do you propose dealing with multiple threads?
Ok, this is a good limitation to know about.
Our system does use multiple threads, but we typically limit it so that at most one thread is running at a time. (There isn't really a point to allowing more, as the GIL will kill any parallel speedup anyway. When we really need parallelism, we use multiple processes.)
If we can promise that exactly one thread will use the context at a time, is it ok to leave the context active in multiple threads? Or is that still bad?
This post seems to indicate that, as of CUDA 4 (i.e. after PyCUDA's context management code was written), it became possible to have one context be current for multiple threads. The docs at https://docs.nvidia.com/cuda/cuda-driver-api/index.html still don't clearly reflect that AFAICT (or maybe I didn't look hard enough). I'd be happy to consider patches (if needed) that allow making use of that in PyCUDA.
For what it's worth, asking around my NVIDIA contacts I was able to dig up a few more links:
The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device.
The base CUDA APIs (driver and runtime) were made thread-safe some time ago.
I am working with a group that is using PyCUDA in a task-based parallel application. We therefore don't have a "main" thread that can "own" the PyCUDA primary context. Instead we found a solution where we push/pop the Primary Context on the context stack:
The problem with this approach is that any variables referring to GPU memory are still in scope when
ctx.pop()
is called. This appears to result in their not being free'd on the device, leaking memory. Explicitly callingdel
on these variables does not stop the memory leak.Our solution has been to create an execution-guard, enuring that all GPU variables are out of scope before
ctx.pop()
is called:@elliottslaughter gets credit for making this work.