[Feature Request] Update OrtValue inplace from CUDA buffer for cuda graph

Describe the feature request

For CUDA Graph, the graph inputs shall be in fixed memory.

Currently, there is a python API OrtValue.update_inplace(np_arr) which accepts numpy ndarray as source. That means the source shall be in CPU.

When the source data (like encoder output) is already in GPU, we have to use external API to copy memory from device to device. It's not convenient.

There are two ways to improve that: (1) add a Python API like OrtValue.update_inplace_from_buffer(source_ptr, bytes) (2) Do memory copy internally in ONNX Runtime when users bind an input in device and CUDA graph is enable. We shall copy the input to a fixed address before launching cuda graph.

Describe scenario use case

In stable diffusion, there are multiple models. When we use cuda graph, the inputs for models are in GPU, and we need to copy inputs to same memory address to launch cuda graph.

Current solution is that we install cuda-python and use cudaMemcpy API. However, that need install a big library. It is better that this could be supported in ORT internally.

microsoft / onnxruntime

[Feature Request] Update OrtValue inplace from CUDA buffer for cuda graph #16616

Describe the feature request

Describe scenario use case