Welcome to x-stable-diffusion
by Stochastic!
This project is a compilation of acceleration techniques for the Stable Diffusion model to help you generate images faster and more efficiently, saving you both time and money.
With example images and a comprehensive benchmark, you can easily choose the best technique for your needs. When you're ready to deploy, our CLI called stochasticx
makes it easy to get started on your local machine. Try x-stable-diffusion
and see the difference it can make for your image generation performance and cost savings.
Make sure you have Python and Docker installed on your system
Install the latest version of stochasticx
library.
pip install stochasticx
Deploy the Stable Diffusion model
stochasticx stable-diffusion deploy --type aitemplate
Alternatively, you can deploy stable diffusion without our CLI by checking the steps here.
To perform inference with this deployed model:
stochasticx stable-diffusion inference --prompt "Riding a horse"
Check all the options of the inference
command:
stochasticx stable-diffusion inference --help
You can get the logs of the deployment executing the following command:
stochasticx stable-diffusion logs
Stop and remove the deployment with this command:
stochasticx stable-diffusion stop
Change the num_inference_steps
to 30
. With this, you can get an image generated in 0.88 seconds.
{
'max_seq_length': 64,
'num_inference_steps': 30,
'image_size': (512, 512)
}
You can also experiment with reducing the image_size
.
In each folder, we will provide a Google Colab notebook with which you can test the full flow and inference on a T4 GPU
Check the README.md
of the following directories:
For hardware, we used 1x40GB A100 GPU with CUDA 11.6 and the results are reported by averaging 50 runs.
The following arguments were used for image generation for all the benchmarks:
{
'max_seq_length': 64,
'num_inference_steps': 50,
'image_size': (512, 512)
}
For batch_size
1, these are the latency results:
project | Latency (s) | GPU VRAM (GB) |
---|---|---|
PyTorch fp16 | 5.77 | 10.3 |
nvFuser fp16 | 3.15 | --- |
FlashAttention fp16 | 2.80 | 7.5 |
TensorRT fp16 | 1.68 | 8.1 |
AITemplate fp16 | 1.38 | 4.83 |
ONNX (CUDA) | 7.26 | 13.3 |
Note: AITemplate might not support T4 GPU yet. Check support here
project | Latency (s) |
---|---|
PyTorch fp16 | 16.2 |
nvFuser fp16 | 19.3 |
FlashAttention fp16 | 13.7 |
TensorRT fp16 | 9.3 |
The following results were obtained by varying batch_size
from 1 to 24.
project \ bs | 1 | 4 | 8 | 16 | 24 |
---|---|---|---|---|---|
Pytorch fp16 | 5.77s/10.3GB | 19.2s/18.5GB | 36s/26.7GB | OOM | |
FlashAttention fp16 | 2.80s/7.5GB | 9.1s/17GB | 17.7s/29.5GB | OOM | |
TensorRT fp16 | 1.68s/8.1GB | OOM | |||
AITemplate fp16 | 1.38s/4.83GB | 4.25s/8.5GB | 7.4s/14.5GB | 15.7s/25GB | 23.4s/36GB |
ONNX (CUDA) | 7.26s/13.3GB | OOM | OOM | OOM | OOM |
Note: TensorRT fails to convert UNet model from ONNX to TensorRT due to memory issues.
Click here to view the complete list of generated images
Optimization \ Prompt | Super Mario learning to fly in an airport, Painting by Leonardo Da Vinci | The Easter bunny riding a motorcycle in New York City | Drone flythrough of a tropical jungle convered in snow |
---|---|---|---|
PyTorch fp16 | |||
nvFuser fp16 | |||
FlashAttention fp16 | |||
TensorRT fp16 | |||
AITemplate fp16 |
As an open source project in a rapidly evolving field, we welcome contributions of all kinds, including new features and better documentation. Please read our contributing guide to learn how you can get involved.
For managed hosting on our cloud or on your private cloud Contact us β