JBlaschke commented 3 years ago

I am working with a group that is using PyCUDA in a task-based parallel application. We therefore don't have a "main" thread that can "own" the PyCUDA primary context. Instead we found a solution where we push/pop the Primary Context on the context stack:

dev = drv.Device(DEV_ID)
ctx = dev.retain_primary_context()
ctx.push()
# ... do work
ctx.pop()

The problem with this approach is that any variables referring to GPU memory are still in scope when ctx.pop() is called. This appears to result in their not being free'd on the device, leaking memory. Explicitly calling del on these variables does not stop the memory leak.

Our solution has been to create an execution-guard, enuring that all GPU variables are out of scope before ctx.pop() is called:

class GPUTaskWrapper(object):
    def __init__(self, thunk):
        self.thunk = thunk
    def __call__(self, *args, **kwargs):
        dev = drv.Device(DEV_ID)
        ctx = dev.retain_primary_context()
        ctx.push()
        try:
            return self.thunk(*args, **kwargs)
        finally:
            ctx.pop()

def gpu_task_wrapper(thunk):
    return GPUTaskWrapper(thunk)

@gpu_task_wrapper
def work_function():
    # ... do work

@elliottslaughter gets credit for making this work.

inducer commented 3 years ago

What you describe is a real problem. And given the current CUDA API, it's difficult to do better than what's currently implemented in PyCUDA. The following are the constraints:

garbage collection may request to deallocate GPU memory at any time.
GPU memory must be freed in the context in which it was allocated.
If that context is not active, it must be pushed onto the context stack.
If another thread is currently using that context, we won't be able to activate it. In this case, PyCUDA will print a warning message and simply drop that memory on the floor.

The only "bulletproof" solution that comes to mind is to maintain a per-context queue of "memory-yet-to-be-deallocated". I'd be open to considering patches implementing this if they are thoroughly tested. If you are able to migrate to a less stateful API (OpenCL?), this problem would disappear on its own.

If you can ensure that no GPU memory gets freed outside your GPUTaskWrapper, then what you have should also work. Another option would be to preallocate all GPU memory you need and hold onto it until program completion, if that's an option.

JBlaschke commented 3 years ago

Thanks for the feedback @inducer I will look at ways to provide a PR. And I share your frustration with CUDA's context management. In the meantime, @elliotslaughter 's solution above is something that works for our needs (for now)

I wonder if this would be simpler to fix in situations where there is ever only one context per GPU (i.e. the way that the CUDA runtime API does it with primary contexts). I.e. then GPU memory could always be freed by using the on-and-only context. Essentially this simpler model would clean up after users if they are only using the primary context, and leave the work of freeing memory up to them if they have fancier uses of multiple contexts.

inducer commented 3 years ago

where there is ever only one context per GPU

But if only one thread can have that context as its active one, how do you propose dealing with multiple threads?

elliottslaughter commented 3 years ago

Ok, this is a good limitation to know about.

Our system does use multiple threads, but we typically limit it so that at most one thread is running at a time. (There isn't really a point to allowing more, as the GIL will kill any parallel speedup anyway. When we really need parallelism, we use multiple processes.)

If we can promise that exactly one thread will use the context at a time, is it ok to leave the context active in multiple threads? Or is that still bad?

inducer commented 3 years ago

This post seems to indicate that, as of CUDA 4 (i.e. after PyCUDA's context management code was written), it became possible to have one context be current for multiple threads. The docs at https://docs.nvidia.com/cuda/cuda-driver-api/index.html still don't clearly reflect that AFAICT (or maybe I didn't look hard enough). I'd be happy to consider patches (if needed) that allow making use of that in PyCUDA.

elliottslaughter commented 3 years ago

For what it's worth, asking around my NVIDIA contacts I was able to dig up a few more links:

https://developer.nvidia.com/blog/cuda-pro-tip-always-set-current-device-avoid-multithreading-bugs/

The CUDA runtime API is thread-safe, which means it maintains per-thread state about the current device.
https://forums.developer.nvidia.com/t/thread-safety-of-culaunchkernel/107459 (in the accepted answer)

The base CUDA APIs (driver and runtime) were made thread-safe some time ago.

inducer commented 3 years ago

306 (recently filed) is related.

inducer / pycuda

Potential for a memory leak when using Primary Context #305

306 (recently filed) is related.