[Performance] How to create multiple tensors with consecutive addresses when the cuda memory is not occupied? - Githubissues

microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator

https://onnxruntime.ai

MIT License

13.91k stars 2.81k forks source link

[Performance] How to create multiple tensors with consecutive addresses when the cuda memory is not occupied? #14742

Open baoachun opened 1 year ago

baoachun commented 1 year ago

Describe the issue

My model has a lot of input nodes, maybe 2500 or so. If cudaMemcpy is performed for each input, the total copy time is about 150ms. Therefore, I'm wondering if it's possible to create tensors of contiguous addresses so that I can do cudaMemcpy only once. An alternative is to apply for a sufficient cuda memory space first, then execute H2D to copy the real input to the device, and finally execute D2D to copy the real input to the data address of the input tensor of onnxruntime. But this method still faces the problem of needing to perform a large number of D2D copies, and the memory usage will be more.

To reproduce

My model has a lot of input nodes, maybe 2500 or so. If cudaMemcpy is performed for each input, the total copy time is about 150ms. Therefore, I'm wondering if it's possible to create tensors of contiguous addresses so that I can do cudaMemcpy only once. An alternative is to apply for a sufficient cuda memory space first, then execute H2D to copy the real input to the device, and finally execute D2D to copy the real input to the data address of the input tensor of onnxruntime. But this method still faces the problem of needing to perform a large number of D2D copies, and the memory usage will be more.

Urgency

No response

Platform

Linux

OS Version

cenot7

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

cuda 11.4, A30

Model File

No response

Is this a quantized model?

No

tianleiwu commented 1 year ago

You can try IO Binding to specify the memory buffer of each inputs. In this way, you could use consecutive addresses, and you need calculate the offset and size of each input.

See API document (https://onnxruntime.ai/docs/api/python/api_summary.html). There is an example like the following:

o_binding = session.io_binding()
io_binding.bind_input(name='input', device_type=X_ortvalue.device_name(), device_id=0, element_type=np.float32, shape=X_ortvalue.shape(), buffer_ptr=X_ortvalue.data_ptr())