microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
13.51k stars 2.76k forks source link

onnx use more memory than pytorch for some model #16264

Open feng-1985 opened 1 year ago

feng-1985 commented 1 year ago

Describe the issue

cuda 10.2 onnx=1.8 onnxruntime-gpu=1.6

For sequnce labeling task (input the token ids, output the start_pos, end_pos), the pytorch use 1.8G, but onnx use 1.9G (although the onnx inference speed is faster). --- torch 1.10, bert base fine-tuning For text classification task, the pytoch use 2.2G, onnx just use 0.8G. -- torch 1.9.0, roberta_base fine-tuning

To reproduce

I am use this script and datasets sequence labeling, and running just five epoch. Then convert the torch model to onnx model.

Urgency

No response

Platform

Linux

OS Version

ubuntu 18

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.6

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

tianleiwu commented 1 year ago

For bert model, try optimizing it like the following:

pip install onnxruntime-gpu==1.15
python -m onnxruntime.transformers.optimizer --input bert.onnx --output bert_fp16.onnx --float16 --use_gpu

If everything is good, it will use fused attention kernel (like flash attention etc), which could save memory for long sequence.

Note that 1.6 does not have fused attention so you will need upgrade onnxruntime-gpu to latest version.

feng-1985 commented 1 year ago

Thanks for response. For the production environment, only cuda 10.2 is available, so i use the onnxruntime-gpu=1.6. Another relate question

  1. If I convert the model to float16, does the cuda 10.2 support ?
  2. python -m onnxruntime.transformers.optimizer --input bert.onnx --output bert_fp16.onnx --float16 --use_gpu use_gpu the default value is false, use the parameter seems just rename the model name and set the EPs url ?
tianleiwu commented 1 year ago

@feng-1985,

  1. float16 is supported in cuda 10.2 and onnxruntime-gpu 1.6.
  2. run python -m onnxruntime.transformers.optimizer --help to see the usage. The tool will apply graph optimization to convert the model graph to a new one. You can try add --use_mask_index which is not default in onnxruntime-gpu 1.6.