[Q & A] Intercepting cudaMallocAsync API may also be suitable to this approach?

wangao1236 commented 1 year ago

Hello, I have read your thesis and code and I think your idea is great! However, I have a question. Since the introduction of Stream-Ordered Memory Allocator in CUDA 11.2, cudaMallocAsync and cudaFreeAsync APIs have been provided. If an application calls cudaMallocAsync and it is also intercepted and replaced with cudaMallocManaged, what impact does it have on the calculation results?

wangao1236 commented 1 year ago

@grgalex

grgalex commented 1 year ago

@wangao1236 Thanks for the feedback and also for the question!

In general, for nvshare we want to make sure that we minimize the amount of non-pageable memory allocations. Currently, we only interpose cuMemAlloc and ignore other Driver API functions that also allocate memory and can potentionally not use cuMemAlloc internally.

I read the CUDA docs on the "Stream Ordered Memory Allocator", which comprises a family of various functions.

My understanding is that yes, we can interpose (I'll use the Driver API function name here) cuMemAllocAsync and cuMemFreeAsync and convert them to cuMemAllocManaged.

This is because the core return entity is a *CUdeviceptr, same as plain cuMemAlloc.

A drawback is that by doing this we nullify the "benefits" of asynchronicity as cuMemAllocManaged is synchronous.

My best guess is that interposing and converting cuMemAllocAsync to cuMemAllocManaged and cuMemFreeAsync to cuMemFree will still give correct calculation results.

I don't have enough free time to implement and test this right now. Do you want to give it a try? We can then discuss your findings and you can finally open a PR with your contribution if all things go well!

wangao1236 commented 1 year ago

A drawback is that by doing this we nullify the "benefits" of asynchronicity as cuMemAllocManaged is synchronous.

My best guess is that interposing and converting cuMemAllocAsync to cuMemAllocManaged and cuMemFreeAsync to cuMemFree will still give correct calculation results.

Thank you very much for your response!

Our team have also researched and thought about the relevant content. As you mentioned, in this situation, the only solution is to forcefully convert the asynchronous operation to synchronous in order to return the correct *CUdeviceptr.

In this case, the asynchronous feature is sacrificed. However, from the perspective of memory oversubscription, it is still valuable!

wangao1236 commented 1 year ago

I don't have enough free time to implement and test this right now. Do you want to give it a try? We can then discuss your findings and you can finally open a PR with your contribution if all things go well!

Thanks for the invitation, we are also willing to contribute to this project to make it compatible with more GPU virtualization scenarios.

grgalex / nvshare

[Q & A] Intercepting cudaMallocAsync API may also be suitable to this approach? #4