microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.67k stars 2.93k forks source link

ORTValue to Pytorch CUDA Tensor Interface #10286

Closed ManuelAngel99 closed 2 years ago

ManuelAngel99 commented 2 years ago

Is your feature request related to a problem? Please describe. I'm implementing a T5 model in ONNX Runtime with the intention of speeding up GPU inference. In order to avoid copying the decoder outputs back and forth from the GPU to the CPU I'm using ONNX Runtime io binding, this allows to easily use Pytorch tensors as inputs to the model using the data_ptr() method of the tensor. Since the dimmensions of the input are known before running the model there is no major issue supplying the input shape to the input binder.

The same procedure can be applied with the output bindings, however, the shape of the output needs to be precalculated and a torch tensor needs to be created in order to store the result. This can be rather cumbersome when dealing with a large number of outputs with different dimmensions (like the past key values of the T5 decoder).

This could be avoided if there was a simple method to create a Torch tensor in the gpu from an ORTValue located in the gpu without needing to transfer the data to the cpu, which causes a considerable latency increase.

Describe the solution you'd like I would like to be able to create a Torch Tensor in the GPU directly from an ORT Value located in the gpu.

For example:

# Dummy ORT Value`
ortvalue = onnxruntime.OrtValue.ortvalue_from_numpy(X, 'cuda', 0)`
torch_tensor = ortvalue.torch_tensor()
ManuelAngel99 commented 2 years ago

After doing some research I have found a plausible workaround using dlpack as mentioned in #4162, however it seems like the to_dlpack() method is neither available in onnxruntime-gpu==1.10.0 or in onnxruntime-training

ManuelAngel99 commented 2 years ago

I finally got it working. If someone is facing the same issue, here are the steps I followed:

  1. Since the standard onnxruntime package distributed in pipy doesn't include the onnxruntime training module and the onnxruntime-training package doesn't provide CUDA support, I needed to download and compile onnxruntime from its source code using the --use_cuda and --enable_training flags. Here is the exact command I used:
    bash ./onnxruntime/build.sh --config Release --build_wheel --parallel --use_openmp --use_cuda --cudnn_home /usr/local/cuda --cuda_home /usr/local/cuda --cmake_extra_defines --skip_tests --enable_training --use_tensorrt --tensorrt_home /opt/tensorrt/
  2. Import the following function from onnxruntime.training
    from onnxruntime.training.ortmodule._utils import _ortvalue_to_torch_tensor
  3. Convert to pytorch:
    
    from onnxruntime import OrtValue
    import torch
    import numpy as np
    # Create a sample OrtValue on the GPU
    x = OrtValue.ortvalue_from_numpy(np.random.rand(3), 'cuda')

Convert to torch

device=torch.device('cuda') a =_ortvalue_to_torch_tensor( x._ortvalue, device)

jeshels commented 2 years ago

I would like to add that due to #9467 I had to add --cuda_version flag to the build flags. For example:

bash ./onnxruntime/build.sh --config Release --build_wheel --parallel --use_openmp --use_cuda --cudnn_home /usr/local/cuda --cuda_home /usr/local/cuda --cmake_extra_defines --skip_tests --enable_training --use_tensorrt --tensorrt_home /opt/tensorrt/ --cuda_version=10.1

After this, all worked as solution mentioned above.

pommedeterresautee commented 2 years ago

Posted in another issue too, but just in case one search the same issue:

much simpler (no need to recompile ORT) -> just provide the torch pointer to the output of your model, etc.:

binding = session.io_binding()
...
# test export to Torch tensor directly
# https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding
logit_output = torch.empty((2, 128, 50257), dtype=torch.float32, device='cuda')
binding.bind_output(name=onnx_named_outputs[0], device_type='cuda', device_id=0, element_type=np.float32, shape=tuple(logit_output.shape), buffer_ptr=logit_output.data_ptr())

Then your tensor is filled with the right values, no copy to CPU, etc. You can also take the pointer as input of an ortvalue.

ManuelAngel99 commented 2 years ago

Posted in another issue too, but just in case one search the same issue:

much simpler (no need to recompile ORT) -> just provide the torch pointer to the output of your model, etc.:

binding = session.io_binding()
...
# test export to Torch tensor directly
# https://onnxruntime.ai/docs/api/python/api_summary.html#iobinding
logit_output = torch.empty((2, 128, 50257), dtype=torch.float32, device='cuda')
binding.bind_output(name=onnx_named_outputs[0], device_type='cuda', device_id=0, element_type=np.float32, shape=tuple(logit_output.shape), buffer_ptr=logit_output.data_ptr())

Then your tensor is filled with the right values, no copy to CPU, etc. You can also take the pointer as input of an ortvalue.

Hi, as mentioned in the OP I was looking for a way to convert the ORTValue to a torch tensor without needing to create the torch tensors with their dimmensions beforehand. If your model only has an output (or a few) and it's size is easy to calculate It is more convenient to use the IOBinding interface (as mentioned in the OP)