Open skyline75489 opened 2 years ago
@skyline75489, Thanks for the feedback. I can reproduce the issue. We'll update the benchmark to only measure the encoder of bart.
For text generation, it is better to use end to end performance test. The decoding latency depending on multiple factors: batch size, context sequence length, number of generated tokens, beam search/ greedy search/ beam sampling, early stop etc. I suggest to modify the following script if you need consider all these factors: https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/transformers/models/bart The script exports model to onnx, and runs some example text generation. Only need small change to measure latency.
Describe the issue
To reproduce
Urgency
No response
Platform
Linux
OS Version
Ubuntu 20.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.12.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.6