Open EASTERNTIGER opened 3 months ago
optimum has assumption that the provider is a string, not a list of string. I suggest to set provider='CUDAExecutionProvider' for optimum.
In onnxruntime, some node might fallback to CPU as shown in the warnings. You can turn on verbose logging to see which node is placed in CPU.
optimum has assumption that the provider is a string, not a list of string. I suggest to set provider='CUDAExecutionProvider' for optimum.
In onnxruntime, some node might fallback to CPU as shown in the warnings. You can turn on verbose logging to see which node is placed in CPU.
yeah,I set provider='CUDAExecutionProvider' ,then it shows warning :Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf. I am sure that the warning makes my inference slower.
You can disable warnings in logging if needed by setting a property log_severity_level
of session options to be 3 or 4.
If you want to improve performance, you will need optimize the model and kv cache buffer (use shared buffers for past and present using I/O binding). As an example, you can optimize T5 using convert_generation tool like:
python -m onnxruntime.transformers.convert_generation -m t5-small --model_type t5 --output ./models/t5/onnx_models/t5_small_beam_search.onnx --use_gpu --past_present_share_buffer --use_decoder_masked_attention
You can disable warnings in logging if needed by setting a property
log_severity_level
of session options to be 3 or 4.If you want to improve performance, you will need optimize the model and kv cache buffer (use shared buffers for past and present using I/O binding). As an example, you can optimize T5 using convert_generation tool like:
python -m onnxruntime.transformers.convert_generation -m t5-small --model_type t5 --output ./models/t5/onnx_models/t5_small_beam_search.onnx --use_gpu --past_present_share_buffer --use_decoder_masked_attention
Thank you so much for your reply!! When I try to run the command you show me, it works well at first. Then there is typeerror:
It seems that the onnx convert is successful,but fail to test.
This issue has been automatically marked as stale due to inactivity and will be closed in 30 days if no further activity occurs. If further support is needed, please provide an update and/or more details.
Describe the issue
Hi,when I use code: model = ORTModelForSeq2SeqLM.from_pretrained(model_path,provider='CUDAExecutionProvider'), It will appear warning which will have a great influence on my inference speed.When I change provider='CUDAExecutionProvider' to provider=['CUDAExecutionProvider','CPUExecutionProvider'],it shows So how I can fix that?
To reproduce
model = ORTModelForSeq2SeqLM.from_pretrained(model_path,provider=['CUDAExecutionProvider','CPUExecutionProvider'])
Urgency
No response
Platform
Linux
OS Version
other
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
1.18.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA12.4