Open feng-1985 opened 1 year ago
For bert model, try optimizing it like the following:
pip install onnxruntime-gpu==1.15
python -m onnxruntime.transformers.optimizer --input bert.onnx --output bert_fp16.onnx --float16 --use_gpu
If everything is good, it will use fused attention kernel (like flash attention etc), which could save memory for long sequence.
Note that 1.6 does not have fused attention so you will need upgrade onnxruntime-gpu to latest version.
Thanks for response. For the production environment, only cuda 10.2 is available, so i use the onnxruntime-gpu=1.6. Another relate question
@feng-1985,
python -m onnxruntime.transformers.optimizer --help
to see the usage. The tool will apply graph optimization to convert the model graph to a new one. You can try add --use_mask_index
which is not default in onnxruntime-gpu 1.6.
Describe the issue
cuda 10.2 onnx=1.8 onnxruntime-gpu=1.6
For sequnce labeling task (input the token ids, output the start_pos, end_pos), the pytorch use 1.8G, but onnx use 1.9G (although the onnx inference speed is faster). --- torch 1.10, bert base fine-tuning For text classification task, the pytoch use 2.2G, onnx just use 0.8G. -- torch 1.9.0, roberta_base fine-tuning
To reproduce
I am use this script and datasets sequence labeling, and running just five epoch. Then convert the torch model to onnx model.
Urgency
No response
Platform
Linux
OS Version
ubuntu 18
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.6
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
No response