Poor performance to generate images with NeuronStableDiffusionPipeline

yahavb commented 2 months ago

System Info

I used https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion to build and deploy inference endpoint. I used the optimum version https://github.com/yahavb/edge_diffusion_on_eks/blob/master/app/run-sd2.py against https://github.com/yahavb/edge_diffusion_on_eks/blob/master/app/run.py.

Inference of a single image with num_inference_steps=1 produced 824.9ms with NeuronStableDiffusionPipeline and 198.5ms for StableDiffusionPipeline.

Who can help?

@JingyaHuang

Information

[X] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

https://github.com/aws-samples/edge_diffusion_on_eks

Expected behavior

Comparable performance of NeuronStableDiffusionPipeline and StableDiffusionPipeline

JingyaHuang commented 2 months ago

Hi @yahavb, thanks for opening the issue. Let me check if I can reproduce it.

JingyaHuang commented 2 months ago

Hey @yahavb, have you tried setting inline_weights_to_neff=True? It's an arg that I recently set to False by default (since we would like to leverage it for caching), and according to my experiment, it seems to slow down quite heavily the inference...

yahavb commented 2 months ago

setting inline_weights_to_neff to True improved the latency performance. thanks!

from optimum.neuron import NeuronStableDiffusionPipeline

compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16","inline_weights_to_neff": "True"}
input_shapes = {"batch_size": batch_size, "height": height, "width": width}
stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained(model_id, export=True, **compiler_args, **input_shapes)

huggingface / optimum-neuron