huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.
Apache License 2.0
176 stars 51 forks source link

Poor performance to generate images with NeuronStableDiffusionPipeline #576

Closed yahavb closed 2 months ago

yahavb commented 2 months ago

System Info

I used https://huggingface.co/docs/optimum-neuron/tutorials/stable_diffusion to build and deploy inference endpoint. I used the optimum version https://github.com/yahavb/edge_diffusion_on_eks/blob/master/app/run-sd2.py against https://github.com/yahavb/edge_diffusion_on_eks/blob/master/app/run.py.

Inference of a single image with num_inference_steps=1 produced 824.9ms with NeuronStableDiffusionPipeline and 198.5ms for StableDiffusionPipeline.

Who can help?

@JingyaHuang

Information

Tasks

Reproduction (minimal, reproducible, runnable)

https://github.com/aws-samples/edge_diffusion_on_eks

Expected behavior

Comparable performance of NeuronStableDiffusionPipeline and StableDiffusionPipeline

JingyaHuang commented 2 months ago

Hi @yahavb, thanks for opening the issue. Let me check if I can reproduce it.

JingyaHuang commented 2 months ago

Hey @yahavb, have you tried setting inline_weights_to_neff=True? It's an arg that I recently set to False by default (since we would like to leverage it for caching), and according to my experiment, it seems to slow down quite heavily the inference...

yahavb commented 2 months ago

setting inline_weights_to_neff to True improved the latency performance. thanks!

from optimum.neuron import NeuronStableDiffusionPipeline

compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16","inline_weights_to_neff": "True"}
input_shapes = {"batch_size": batch_size, "height": height, "width": width}
stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained(model_id, export=True, **compiler_args, **input_shapes)