microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.68k stars 2.93k forks source link

ORTModelForSeq2SeqLM.from_pretrained can not use provider=['CUDAExecutionProvider','CPUExecutionProvider'] #21733

Open EASTERNTIGER opened 3 months ago

EASTERNTIGER commented 3 months ago

Describe the issue

Hi,when I use code: model = ORTModelForSeq2SeqLM.from_pretrained(model_path,provider='CUDAExecutionProvider'), It will appear warning image which will have a great influence on my inference speed.When I change provider='CUDAExecutionProvider' to provider=['CUDAExecutionProvider','CPUExecutionProvider'],it shows image So how I can fix that?

To reproduce

model = ORTModelForSeq2SeqLM.from_pretrained(model_path,provider=['CUDAExecutionProvider','CPUExecutionProvider'])

Urgency

No response

Platform

Linux

OS Version

other

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA12.4

tianleiwu commented 3 months ago

optimum has assumption that the provider is a string, not a list of string. I suggest to set provider='CUDAExecutionProvider' for optimum.

In onnxruntime, some node might fallback to CPU as shown in the warnings. You can turn on verbose logging to see which node is placed in CPU.

EASTERNTIGER commented 3 months ago

optimum has assumption that the provider is a string, not a list of string. I suggest to set provider='CUDAExecutionProvider' for optimum.

In onnxruntime, some node might fallback to CPU as shown in the warnings. You can turn on verbose logging to see which node is placed in CPU.

yeah,I set provider='CUDAExecutionProvider' ,then it shows warning :Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. I am sure that the warning makes my inference slower.

tianleiwu commented 3 months ago

You can disable warnings in logging if needed by setting a property log_severity_level of session options to be 3 or 4.

If you want to improve performance, you will need optimize the model and kv cache buffer (use shared buffers for past and present using I/O binding). As an example, you can optimize T5 using convert_generation tool like:

python -m onnxruntime.transformers.convert_generation -m t5-small --model_type t5 --output ./models/t5/onnx_models/t5_small_beam_search.onnx --use_gpu --past_present_share_buffer --use_decoder_masked_attention
EASTERNTIGER commented 3 months ago

You can disable warnings in logging if needed by setting a property log_severity_level of session options to be 3 or 4.

If you want to improve performance, you will need optimize the model and kv cache buffer (use shared buffers for past and present using I/O binding). As an example, you can optimize T5 using convert_generation tool like:

python -m onnxruntime.transformers.convert_generation -m t5-small --model_type t5 --output ./models/t5/onnx_models/t5_small_beam_search.onnx --use_gpu --past_present_share_buffer --use_decoder_masked_attention

Thank you so much for your reply!! When I try to run the command you show me, it works well at first. Then there is typeerror:

image It seems that the onnx convert is successful,but fail to test.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.