CUDA creates an execution pipeline on Device (that is, where GPU runs). For efficiency, it’s the best to make Host operations on Device whenever possible. There is a CUDA API that allows explicitly copying data from Host to Device, from Device to Host, from Device to Device – cudaMemcpy() and its derivatives. This function is widely used within onnxruntime and even more in onnxruntime-genai.
The performance bottleneck of generative LLM models is the huge KV-cache. When using onnxruntime even with CUDA, the output KV-cache is to be copied to the Host, then reordered as per beam indices, then fed back to LLM model as input for generation of a next token. Unlike it, onnxruntime-genai has a special function that does all those operations on Device memory space - for that it uses cudaMemcpyAsync. This is, to my understanding, is the root cause of the much better performance of onnxruntime-genai over onnxruntime.
However, CUDA toolkit is available to onnxruntime when compiled with "--use_cuda" flag. But an application that is built against the onnxruntime with CUDA does not have access to cudaMemcpy because onnxruntime API does not expose it. The only way for the application to achieve the performance comparable to onnxruntime-genai's is to compile it both against onnxruntime with CUDA and CUDA Toolkit.
Am I correct in my understanding of 1. the root cause and 2. the inability to get the maximum performance of solely building against onnxruntime + CUDA?
Describe the issue
CUDA creates an execution pipeline on Device (that is, where GPU runs). For efficiency, it’s the best to make Host operations on Device whenever possible. There is a CUDA API that allows explicitly copying data from Host to Device, from Device to Host, from Device to Device – cudaMemcpy() and its derivatives. This function is widely used within onnxruntime and even more in onnxruntime-genai.
The performance bottleneck of generative LLM models is the huge KV-cache. When using onnxruntime even with CUDA, the output KV-cache is to be copied to the Host, then reordered as per beam indices, then fed back to LLM model as input for generation of a next token. Unlike it, onnxruntime-genai has a special function that does all those operations on Device memory space - for that it uses cudaMemcpyAsync. This is, to my understanding, is the root cause of the much better performance of onnxruntime-genai over onnxruntime.
However, CUDA toolkit is available to onnxruntime when compiled with "--use_cuda" flag. But an application that is built against the onnxruntime with CUDA does not have access to cudaMemcpy because onnxruntime API does not expose it. The only way for the application to achieve the performance comparable to onnxruntime-genai's is to compile it both against onnxruntime with CUDA and CUDA Toolkit.
Am I correct in my understanding of 1. the root cause and 2. the inability to get the maximum performance of solely building against onnxruntime + CUDA?
To reproduce
Nothing to reproduce, this is an analysis
Urgency
No response
Platform
Linux
OS Version
ubuntu 20
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
onnxruntime 1.18.0
ONNX Runtime API
C++
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
Yes