Describe the bug
I have tried to run stable diffusion 2.1 using DeepSpeed on nvidia A10 GPU. Overall process is working and I am also getting the reduction in latency. However, the generated images are quite weird. I have tried the same thing for Stable diffusion 1.4. It is working perfectly there. Can someone help me to resolve this issue? I have listed the code and dependencies used in process.
Thanks.
import re
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
from torch import inference_mode
import deepspeed
from time import perf_counter
import numpy as np
HF_MODEL_ID="stabilityai/stable-diffusion-2-1"
HF_TOKEN="" # your hf token: https://huggingface.co/settings/tokens
#pipe = DiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16 ,use_auth_token=HF_TOKEN).to("cuda")
pipe = StableDiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16).to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
#Used to conduct timings
def measure_latency(pipe, prompt):
latencies = []
# warm up
pipe.set_progress_bar_config(disable=True)
for _ in range(2):
_ = pipe(prompt, height=512, width=512)
# Timed run
for i in range(10):
start_time = perf_counter()
a = pipe(prompt, height=512, width=512)
a.images[0].save("aimg"+str(i)+".png")
latency = perf_counter() - start_time
latencies.append(latency)
# Compute run statistics
time_avg_s = np.mean(latencies)
time_std_s = np.std(latencies)
time_p95_s = np.percentile(latencies,95)
return f"P95 latency (seconds) - {time_p95_s:.2f}; Average latency (seconds) - {time_avg_s:.2f} +\- {time_std_s:.2f};", time_p95_s
prompt = "a photo of an astronaut riding a horse on mars"
with torch.inference_mode():
deepspeed.init_inference(
model=getattr(pipe,"model", pipe), # Transformers models
# mp_size=1, # Number of GPU
dtype=torch.float16, # dtype of the weights (fp16)
# replace_method="auto", # Lets DS autmatically identify the layer to replace
replace_with_kernel_inject=True, # replace the model with the kernel injector
)
# a = pipe(prompt, height=512, width=512)
# print(a)
ds_results = measure_latency(pipe,prompt)
print(f"DeepSpeed model: {ds_results[0]}")
Output from above code:
/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
from pandas.core.computation.check import NUMEXPR_INSTALLED
Fetching 13 files: 100%
13/13 [00:00<00:00, 830.64it/s]
[2023-04-21 00:45:34,490] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1+3a2dc40d, git-hash=3a2dc40d, git-branch=HEAD
[2023-04-21 00:45:34,492] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
**** found and replaced vae w. <class 'deepspeed.model_implementations.diffusers.vae.DSVAE'>
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Time to load transformer_inference op: 0.3998148441314697 seconds
[2023-04-21 00:45:35,814] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Attention config: {'layer_id': 0, 'hidden_size': 320, 'intermediate_size': 1280, 'heads': 5, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': False, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 4096, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Loading extension module transformer_inference...
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.07170987129211426 seconds
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/spatial_inference/build.ninja...
Building extension module spatial_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.4000411033630371 seconds
**** found and replaced unet w. <class 'deepspeed.model_implementations.diffusers.unet.DSUNet'>
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module spatial_inference, skipping build step...
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.07284045219421387 seconds
DeepSpeed model: P95 latency (seconds) - 2.60; Average latency (seconds) - 2.59 +\- 0.01;
ds_report output
2023-04-21 00:48:19.878939: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:
Local host: 192-9-246-221
Device name: mlx5_0
Device vendor ID: 0x02c9
Device vendor part ID: 4126
Default device parameters will be used, which may result in lower
performance. You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.
NOTE: You can turn off this warning by setting the MCA parameter
btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port. As such, the openib BTL (OpenFabrics
support) will be disabled for this port.
Local host: 192-9-246-221
Local device: mlx5_0
Local port: 1
CPCs attempted: udcm
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1+3a2dc40d, 3a2dc40d, HEAD
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6
Screenshots
System info (please complete the following information):
Describe the bug I have tried to run stable diffusion 2.1 using DeepSpeed on nvidia A10 GPU. Overall process is working and I am also getting the reduction in latency. However, the generated images are quite weird. I have tried the same thing for Stable diffusion 1.4. It is working perfectly there. Can someone help me to resolve this issue? I have listed the code and dependencies used in process. Thanks.
To Reproduce
Output from above code:
ds_report output
Screenshots
System info (please complete the following information):