SD: compile `post_quant_conv`?

huggingface / optimum-neuron

Easy, fast and very cheap training and inference on AWS Trainium and Inferentia chips.

Apache License 2.0

193 stars 59 forks source link

SD: compile `post_quant_conv`? #333

Closed neo closed 10 months ago

neo commented 10 months ago

with the samples from the aws neuron team, the VAE post_quant_conv is compiled: https://github.com/aws-neuron/aws-neuron-samples/blob/master/torch-neuronx/inference/hf_pretrained_sdxl_base_1024_inference.ipynb

however, with our lib here, it's not being compiled; is that something we can do?

JingyaHuang commented 10 months ago

Hi @neo, in optimum-neuron we do compile post_quant_conv layer.

Here we override forward of vae by the decode function and trace both. It means that the VAE Decoder exported by Optimum contains both "vae decoder" and the post_quant_conv.

Implemenation in diffusers
What we do in optimum-neuron

https://github.com/huggingface/optimum-neuron/blob/26c31a73c2a78dfd0205dae5150a97c1021bfec8/optimum/exporters/neuron/utils.py#L294-L297

neo commented 10 months ago

Hey @JingyaHuang! thank you for such quick and concise explanation ❤️

I did do a quick search in the diffusers library for post_quant_conv, I guess I just didn't understand enough about the difference between AutoencoderKL and VQModel

neo commented 10 months ago

I also had another follow-up question from reading the samples from the aws neuron team that's also kind of related to the other discussion we had here: https://huggingface.co/aws-neuron/stable-diffusion-xl-base-1-0-1024x1024/discussions/3

I see the set_dynamic_batching is used only when loading the unet model with DataParallel and wonder: does dynamic_batch_size need to be a compilation-time option as a compiler_args instead of being set when it's being load after compilation?

Thanks!! ❤️

JingyaHuang commented 10 months ago

Hi @neo,

For the diffusers implementation, I pointed you to VQ-VAE, which might be slightly confusing, sorry for that. Either VQ VAE or regular VAE: AutoencoderKL, in Optimum we try to trace their decode function to avoid compiling two separate artifacts.

decode() of AutoencoderKL

And for your question about the dynamic batching, if you want to enable dynamic batching, you will need to set dynamic_batch_size=True when compiling your model with the API:

https://github.com/huggingface/optimum-neuron/blob/d00d7090dca4ee82198cd0d9448b5579e104c41c/optimum/neuron/modeling_diffusion.py#L500

Or passing --dynamic-batch-size to the CLI:

https://github.com/huggingface/optimum-neuron/blob/d00d7090dca4ee82198cd0d9448b5579e104c41c/optimum/commands/export/neuronx.py#L87-L91

set_dynamic_batching is more likely to be an option to turn off the dynamic batching (if the model has been compiled with the option on) instead of the contrary.

More details here

neo commented 10 months ago

Okay so dynamic batching is indeed a compilation flag, thank you again for clarifying!!

neo commented 10 months ago

Sorry hopefully last follow-up question 😂 then why does --num_images_per_prompt need to be a compilation option? or is it just for convenience to set a default for the output config that can be override during inference?