[Performance] the root cause of onnxruntime-genai's being much faster than onnxruntime?

Describe the issue

CUDA creates an execution pipeline on Device (that is, where GPU runs). For efficiency, it’s the best to make Host operations on Device whenever possible. There is a CUDA API that allows explicitly copying data from Host to Device, from Device to Host, from Device to Device – cudaMemcpy() and its derivatives. This function is widely used within onnxruntime and even more in onnxruntime-genai.

The performance bottleneck of generative LLM models is the huge KV-cache. When using onnxruntime even with CUDA, the output KV-cache is to be copied to the Host, then reordered as per beam indices, then fed back to LLM model as input for generation of a next token. Unlike it, onnxruntime-genai has a special function that does all those operations on Device memory space - for that it uses cudaMemcpyAsync. This is, to my understanding, is the root cause of the much better performance of onnxruntime-genai over onnxruntime.

However, CUDA toolkit is available to onnxruntime when compiled with "--use_cuda" flag. But an application that is built against the onnxruntime with CUDA does not have access to cudaMemcpy because onnxruntime API does not expose it. The only way for the application to achieve the performance comparable to onnxruntime-genai's is to compile it both against onnxruntime with CUDA and CUDA Toolkit.

Am I correct in my understanding of 1. the root cause and 2. the inability to get the maximum performance of solely building against onnxruntime + CUDA?

To reproduce

Nothing to reproduce, this is an analysis

Urgency

No response

Platform

Linux

OS Version

ubuntu 20

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

onnxruntime 1.18.0

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

Yes

microsoft / onnxruntime