Images generated from stable diffusion 2.1 are weird

Describe the bug I have tried to run stable diffusion 2.1 using DeepSpeed on nvidia A10 GPU. Overall process is working and I am also getting the reduction in latency. However, the generated images are quite weird. I have tried the same thing for Stable diffusion 1.4. It is working perfectly there. Can someone help me to resolve this issue? I have listed the code and dependencies used in process. Thanks.

To Reproduce

!pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url https://download.pytorch.org/whl/cu116
!pip install git+https://github.com/microsoft/deepspeed@3a2dc40d54489b176981cf24c7c1f296c8fc5d30 
!pip install --upgrade diffusers==0.11.0 transformers==4.24.0 safetensors scipy triton==2.0.0.dev20221031 accelerate --upgrade
!pip install ftfy

import re
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
from torch import inference_mode
import deepspeed
from time import perf_counter
import numpy as np

HF_MODEL_ID="stabilityai/stable-diffusion-2-1"
HF_TOKEN="" # your hf token: https://huggingface.co/settings/tokens

#pipe = DiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16 ,use_auth_token=HF_TOKEN).to("cuda")

pipe = StableDiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16).to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

#Used to conduct timings
def measure_latency(pipe, prompt):
    latencies = []
    # warm up
    pipe.set_progress_bar_config(disable=True)
    for _ in range(2):
        _ =  pipe(prompt, height=512, width=512)
    # Timed run
    for i in range(10):
        start_time = perf_counter()
        a = pipe(prompt,  height=512, width=512)
        a.images[0].save("aimg"+str(i)+".png")
        latency = perf_counter() - start_time
        latencies.append(latency)
    # Compute run statistics
    time_avg_s = np.mean(latencies)
    time_std_s = np.std(latencies)
    time_p95_s = np.percentile(latencies,95)
    return f"P95 latency (seconds) - {time_p95_s:.2f}; Average latency (seconds) - {time_avg_s:.2f} +\- {time_std_s:.2f};", time_p95_s

prompt = "a photo of an astronaut riding a horse on mars"

with torch.inference_mode():
    deepspeed.init_inference(
      model=getattr(pipe,"model", pipe),      # Transformers models
      # mp_size=1,        # Number of GPU
      dtype=torch.float16, # dtype of the weights (fp16)
      # replace_method="auto", # Lets DS autmatically identify the layer to replace
      replace_with_kernel_inject=True, # replace the model with the kernel injector
  )
    # a = pipe(prompt,  height=512, width=512)
    # print(a)
    ds_results = measure_latency(pipe,prompt)
    print(f"DeepSpeed model: {ds_results[0]}")

Output from above code:

/home/ubuntu/.local/lib/python3.8/site-packages/pandas/core/computation/expressions.py:20: UserWarning: Pandas requires version '2.7.3' or newer of 'numexpr' (version '2.7.1' currently installed).
  from pandas.core.computation.check import NUMEXPR_INSTALLED
Fetching 13 files: 100%
13/13 [00:00<00:00, 830.64it/s]
[2023-04-21 00:45:34,490] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed info: version=0.9.1+3a2dc40d, git-hash=3a2dc40d, git-branch=HEAD
[2023-04-21 00:45:34,492] [INFO] [logging.py:96:log_dist] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
**** found and replaced vae w. <class 'deepspeed.model_implementations.diffusers.vae.DSVAE'>
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Time to load transformer_inference op: 0.3998148441314697 seconds
[2023-04-21 00:45:35,814] [INFO] [logging.py:96:log_dist] [Rank -1] DeepSpeed-Attention config: {'layer_id': 0, 'hidden_size': 320, 'intermediate_size': 1280, 'heads': 5, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': False, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 4096, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Loading extension module transformer_inference...
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.07170987129211426 seconds
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/spatial_inference/build.ninja...
Building extension module spatial_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.4000411033630371 seconds
**** found and replaced unet w. <class 'deepspeed.model_implementations.diffusers.unet.DSUNet'>
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module spatial_inference, skipping build step...
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.07284045219421387 seconds
DeepSpeed model: P95 latency (seconds) - 2.60; Average latency (seconds) - 2.59 +\- 0.01;

ds_report output

2023-04-21 00:48:19.878939: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            192-9-246-221
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4126

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
No OpenFabrics connection schemes reported that they were able to be
used on a specific port.  As such, the openib BTL (OpenFabrics
support) will be disabled for this port.

  Local host:           192-9-246-221
  Local device:         mlx5_0
  Local port:           1
  CPCs attempted:       udcm
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/home/ubuntu/.local/lib/python3.8/site-packages/torch']
torch version .................... 1.13.1+cu116
deepspeed install path ........... ['/home/ubuntu/.local/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.9.1+3a2dc40d, 3a2dc40d, HEAD
torch cuda version ............... 11.6
torch hip version ................ None
nvcc version ..................... 11.6
deepspeed wheel compiled w. ...... torch 1.13, cuda 11.6

Screenshots aimg2

System info (please complete the following information):

OS: [e.g. Ubuntu 18.04]
GPU : Nvidia A10

microsoft / DeepSpeed

Images generated from stable diffusion 2.1 are weird #3328