microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.13k stars 2.85k forks source link

[TOOLS]:Using transformers.optimizer optimize large model, segmentation fault (core dumped) #17212

Open han65487312 opened 1 year ago

han65487312 commented 1 year ago

Describe the issue

When I use transformers.optimizer to optimize UNet model which is larger than 2GB, the remove_useless_cast_nodes pass will cause segfault. I find that symbolic shape inference in remove_useless_cast_nodes broke down.

The command is: python3 -m onnxruntime.transformers.optimizer --input ./unet_onnx/original_model/unet.onnx --output ./unet_onnx/fuse_fp16_model/unet.onnx --model_type unet --opt_level 99 --float16 --use_gpu

And when I turn off some optimizations, the optimized model can not run on Tensorrt backend. the error message is " onnx.ModelProto exceeded maximum protobuf size of 2GB: 2357166045". The cudnn backend runs ok.

Here are the library versions I am using:

To reproduce

The model is too large that I can not upload it here.

Urgency

No response

Platform

Linux

OS Version

Ubuntu 7.5.0-3ubuntu1~18.04

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.16.0

ONNX Runtime API

Python

Architecture

X64

Execution Provider

Default CPU, CUDA, TensorRT

Execution Provider Library Version

No response

tianleiwu commented 1 year ago

@han65487312,

segmentation fault (core dumped) might be caused by protobuf. You can downgrade protobuf to 3.20.3 and try again.

the optimizer is for CUDA provider, and it need the UNet to be float32 model, and use --opt_level 0.

The optimizer is not for TensorRT EP because TensorRT has its own graph optimization logic.

For TRT EP, you can try the following for SD 1.5 or 2.1 model: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/onnxruntime_tensorrt_txt2img.py Basically, it follows same logic in https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/Diffusion to generate the onnx models for TensorRT backend.

Example code

from onnxruntime.transformers.models.stable_diffusion.onnxruntime_tensorrt_txt2img import OnnxruntimeTensorRTStableDiffusionPipeline
from diffusers.schedulers import DDIMScheduler

model_name_or_path = "runwayml/stable-diffusion-v1-5"
scheduler = DDIMScheduler.from_pretrained(model_name_or_path, subfolder="scheduler")

 pipe = OnnxruntimeTensorRTStableDiffusionPipeline.from_pretrained(
        model_name_or_path,
        revision="fp16",
        torch_dtype=torch.float16,
        scheduler=scheduler,
        image_height=512,
        image_width=512,
        max_batch_size=4,
    )

# re-use cached folder to save ONNX models and TensorRT Engines
pipe.set_cached_folder(model_name_or_path, revision="fp16")

pipe = pipe.to("cuda")

prompt = "photorealistic new zealand hills"
image = pipe(prompt).images[0]
image.save("ort_trt_txt2img_new_zealand_hills.png")

For SDXL, currently we are still working on the optimization.

For more information, see https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/transformers/models/stable_diffusion/README.md

han65487312 commented 1 year ago

Thanks for your reply. In deed, my model UNet is a customized model. It's not in diffusers repo. I wonder is there some ways that make attention run in cudnn backend and other optimizations run in trt backend. The steps I use the optimization in transformers.optimizer are 1. export my customized fp32 UNet model. 2. use transformers.optimizer fuse attention layer. 3. run the model by onnxruntime. If I set the --opt_level 0, the step2 fuse the attention layer would not work.

tianleiwu commented 1 year ago

@han65487312,

I wonder is there some ways that make attention run in cudnn backend and other optimizations run in trt backend. If you use both TRT and CUDA providers in session creation, ORT will partition those fused nodes to CUDA EP, and the others to TRT EP.

However, that might not be a good way to use TRT since TRT need convert NCHW to NHWC layout for the whole graph internally. If you use the optimizer for CUDA EP, TRT could not reach its full potential since TRT only works on subgraphs.

--opt_level 0 is required for ORT <= 1.16 since previously ORT cannot save optimized model > 2GB. This constraint is removed in ORT 1.16 (built from source).

I think TRT could handle model > 2GB since TRT can run SDXL model which is larger than 2GB. @chilo-ms, it there some limitation in TRT EP?