Closed yahavb closed 2 months ago
Hi @yahavb, thanks for opening the issue. Let me check if I can reproduce it.
Hey @yahavb, have you tried setting inline_weights_to_neff=True
? It's an arg that I recently set to False
by default (since we would like to leverage it for caching), and according to my experiment, it seems to slow down quite heavily the inference...
setting inline_weights_to_neff
to True
improved the latency performance. thanks!
from optimum.neuron import NeuronStableDiffusionPipeline
compiler_args = {"auto_cast": "matmul", "auto_cast_type": "bf16","inline_weights_to_neff": "True"}
input_shapes = {"batch_size": batch_size, "height": height, "width": width}
stable_diffusion = NeuronStableDiffusionPipeline.from_pretrained(model_id, export=True, **compiler_args, **input_shapes)
System Info
Who can help?
@JingyaHuang
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction (minimal, reproducible, runnable)
https://github.com/aws-samples/edge_diffusion_on_eks
Expected behavior
Comparable performance of NeuronStableDiffusionPipeline and StableDiffusionPipeline