Open wangao1236 opened 1 year ago
@grgalex
@wangao1236 Thanks for the feedback and also for the question!
In general, for nvshare
we want to make sure that we minimize the amount of non-pageable memory allocations. Currently, we only interpose cuMemAlloc
and ignore other Driver API functions that also allocate memory and can potentionally not use cuMemAlloc
internally.
I read the CUDA docs on the "Stream Ordered Memory Allocator", which comprises a family of various functions.
My understanding is that yes, we can interpose (I'll use the Driver API function name here) cuMemAllocAsync
and cuMemFreeAsync
and convert them to cuMemAllocManaged
.
This is because the core return entity is a *CUdeviceptr
, same as plain cuMemAlloc
.
A drawback is that by doing this we nullify the "benefits" of asynchronicity as cuMemAllocManaged
is synchronous.
My best guess is that interposing and converting cuMemAllocAsync
to cuMemAllocManaged
and cuMemFreeAsync
to cuMemFree
will still give correct calculation results.
I don't have enough free time to implement and test this right now. Do you want to give it a try? We can then discuss your findings and you can finally open a PR with your contribution if all things go well!
A drawback is that by doing this we nullify the "benefits" of asynchronicity as
cuMemAllocManaged
is synchronous.My best guess is that interposing and converting
cuMemAllocAsync
tocuMemAllocManaged
andcuMemFreeAsync
tocuMemFree
will still give correct calculation results.
Thank you very much for your response!
Our team have also researched and thought about the relevant content. As you mentioned, in this situation, the only solution is to forcefully convert the asynchronous operation to synchronous in order to return the correct *CUdeviceptr.
In this case, the asynchronous feature is sacrificed. However, from the perspective of memory oversubscription, it is still valuable!
I don't have enough free time to implement and test this right now. Do you want to give it a try? We can then discuss your findings and you can finally open a PR with your contribution if all things go well!
Thanks for the invitation, we are also willing to contribute to this project to make it compatible with more GPU virtualization scenarios.
Hello, I have read your thesis and code and I think your idea is great! However, I have a question. Since the introduction of Stream-Ordered Memory Allocator in CUDA 11.2, cudaMallocAsync and cudaFreeAsync APIs have been provided. If an application calls cudaMallocAsync and it is also intercepted and replaced with cudaMallocManaged, what impact does it have on the calculation results?