microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.05k stars 2.83k forks source link

[Feature Request] #20010

Open inisis opened 5 months ago

inisis commented 5 months ago

Describe the feature request

Hi, onnx and onnxruntime are great. I have built a tool named onnxslim, which can help optimize onnx model especially large language model, is there any chance that this tool can be used in onnxruntime repo. Thanks.

Describe scenario use case

pip install onnxslim
onnxslim raw.onnx slim.onnx

example show how onnxslim can slim qwen-1.8b from alibaba

image

inisis commented 5 months ago

@tianleiwu Can you please review it

tianleiwu commented 5 months ago

@inisis, thanks for creating a helpful tool for ONNX community.

Onnx Runtime has graph optimizations during creating session. They are implemented in C++ as listed in https://github.com/microsoft/onnxruntime/blob/06fe4f31131a6873a295ba47ed60f4cb16584296/orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

Another is python based offline optimization tool for transformers: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/optimizer.py It fuses some subgraph into custom operator like Attention/SkipLayerNorm/BiasGelu etc. It could also convert fp32 model to fp16 mixed precision model. It's targeted for popular models like BERT/BART/T5/StableDiffusion. After fusion is done, there are only essential nodes left in onnx graph, and I think onnxslim might not help much in those models.

Related doc can be found here: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html https://onnxruntime.ai/docs/performance/transformers-optimization.html

For LLMs, we start using torch dynamo exporter. The fusion pattern could be different from torchscript based onnx exporter.

I did a quick look at onnxslim. some fusion patterns might be able to add to C++ optimizer. That need porting some code from python to C++.

inisis commented 5 months ago

@inisis, thanks for creating a helpful tool for ONNX community.

Onnx Runtime has graph optimizations during creating session. They are implemented in C++ as listed in https://github.com/microsoft/onnxruntime/blob/06fe4f31131a6873a295ba47ed60f4cb16584296/orttraining/orttraining/core/optimizer/graph_transformer_utils.cc

Another is python based offline optimization tool for transformers: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/optimizer.py It fuses some subgraph into custom operator like Attention/SkipLayerNorm/BiasGelu etc. It could also convert fp32 model to fp16 mixed precision model. It's targeted for popular models like BERT/BART/T5/StableDiffusion. After fusion is done, there are only essential nodes left in onnx graph, and I think onnxslim might not help much in those models.

Related doc can be found here: https://onnxruntime.ai/docs/performance/model-optimizations/graph-optimizations.html https://onnxruntime.ai/docs/performance/transformers-optimization.html

For LLMs, we start using torch dynamo exporter. The fusion pattern could be different from torchscript based onnx exporter.

I did a quick look at onnxslim. some fusion patterns might be able to add to C++ optimizer. That need porting some code from python to C++.

so the reason why I wrote onnxslim is that I feel C++ based project is hard for beginners, but onnxslim is pure python, and onnxslim aims at more generalized optimization techniques but not platform targeted. I'm also working on torch dynamo exported onnx with onnxslim, hope to hear more details from you, thanks!