UCM/CUDA: NVLink double free and cache not cleaned

pentschev commented 5 years ago

I've been debugging for some time an out of memory error in UCX-Py which I managed to trace down to the CUDA IPC cache. When setting UCX_CUDA_IPC_CACHE=n there's no out of memory error anymore, but in some circumstances the application crashes in https://github.com/openucx/ucx/blob/master/src/uct/cuda/cuda_ipc/cuda_ipc_iface.c#L206-L209 due to an attempt of unmapping the same memory address multiple times.

There are two issues:

When cache is enabled, it seems to never clean up, leading to out of memory errors;
When cache is disabled, some memory maps seem to be freed multiple times, causing a fatal error.

For the second item, @MattBBaker suggested this may be due to the RDMA zcopy interface that's implemented by CUDA IPC, where the same memory region is used and probably shouldn't be freed multiple times, as discussed in https://github.com/rapidsai/ucx-py/issues/315.

I am now wondering, are the two issues above bugs or somehow am I missing a configuration? If they're bugs, I'm happy to work on a fix, but I would like some guidance on the correct way these should be fixed.

cc @MattBBaker @quasiben

bureddy commented 5 years ago

@pentschev 1) yes, there is a possibility that cache may never get cleanup if there is no conflict in the opened mappings, @Akshay-Venkatesh @ssaulters maybe we need some kind limit here ( # entries?)

2) It looks to me that reusing the same address is the issue here. this could happen if there are multiple simultaneous transfers with the same address on the fly. assuming cuda driver don't maintain reference count for open/close memhandle.

pentschev commented 5 years ago

@pentschev

yes, there is a possibility that cache may never get cleanup if there is no conflict in the opened mappings, @Akshay-Venkatesh @ssaulters maybe we need some kind limit here ( # entries?)

That makes sense. I think we do need a limit and perhaps a safe cleanup as well, i.e., if a cudaMalloc fails, clean cache and try again.

It looks to me that reusing the same address is the issue here. this could happen if there are multiple simultaneous transfers with the same address on the fly. assuming cuda driver don't maintain reference count for open/close memhandle.

I think they don't maintain a refcount, but I may be mistaken. Regardless of that, what would you say is the correct solution? Not reusing the same addresses or something else?

bureddy commented 5 years ago

@pentschev does it make sense to reusing the same addresses in multiple outstanding transfers. data won't be consistent right?

pentschev commented 5 years ago

@pentschev does it make sense to reusing the same addresses in multiple outstanding transfers. data won't be consistent right?

It depends, the design may decide to enqueue transfers and reuse the same mapping. If it was over PCIe, that could be ok performance-wise to do that, since we couldn't have multiple transfers in a direction at the same time, I don't know if that's the case for NVLink though. But based on the way you formulated your comment, I assume you're saying we shouldn't reuse a memory address, which makes sense to me as well, so I'm gonna go ahead and try to do that.

openucx / ucx

UCM/CUDA: NVLink double free and cache not cleaned #4410