Images generated from stable diffusion 2.1 are weird #3328

Open DhruvThu opened 1 year ago

DhruvThu commented 1 year ago

Describe the bug I have tried to run stable diffusion 2.1 using DeepSpeed on nvidia A10 GPU. Overall process is working and I am also getting the reduction in latency. However, the generated images are quite weird. I have tried the same thing for Stable diffusion 1.4. It is working perfectly there. Can someone help me to resolve this issue? I have listed the code and dependencies used in process. Thanks.

To Reproduce

!pip install torch==1.13.1+cu116 torchvision==0.14.1+cu116 torchaudio==0.13.1 --extra-index-url
!pip install git+ 
!pip install --upgrade diffusers==0.11.0 transformers==4.24.0 safetensors scipy triton==2.0.0.dev20221031 accelerate --upgrade
!pip install ftfy
import re
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
import torch
from torch import inference_mode
import deepspeed
from time import perf_counter
import numpy as np

HF_TOKEN="" # your hf token:

#pipe = DiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16 ,use_auth_token=HF_TOKEN).to("cuda")

pipe = StableDiffusionPipeline.from_pretrained(HF_MODEL_ID, torch_dtype=torch.float16).to("cuda")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe ="cuda")

#Used to conduct timings
def measure_latency(pipe, prompt):
    latencies = []
    # warm up
    for _ in range(2):
        _ =  pipe(prompt, height=512, width=512)
    # Timed run
    for i in range(10):
        start_time = perf_counter()
        a = pipe(prompt,  height=512, width=512)
        latency = perf_counter() - start_time
    # Compute run statistics
    time_avg_s = np.mean(latencies)
    time_std_s = np.std(latencies)
    time_p95_s = np.percentile(latencies,95)
    return f"P95 latency (seconds) - {time_p95_s:.2f}; Average latency (seconds) - {time_avg_s:.2f} +\- {time_std_s:.2f};", time_p95_s

prompt = "a photo of an astronaut riding a horse on mars"

with torch.inference_mode():
      model=getattr(pipe,"model", pipe),      # Transformers models
      # mp_size=1,        # Number of GPU
      dtype=torch.float16, # dtype of the weights (fp16)
      # replace_method="auto", # Lets DS autmatically identify the layer to replace
      replace_with_kernel_inject=True, # replace the model with the kernel injector
    # a = pipe(prompt,  height=512, width=512)
    # print(a)
    ds_results = measure_latency(pipe,prompt)
    print(f"DeepSpeed model: {ds_results[0]}")

Output from above code:

Fetching 13 files: 100%
13/13 [00:00<00:00, 830.64it/s]
[2023-04-21 00:45:34,490] [INFO] [] [Rank -1] DeepSpeed info: version=0.9.1+3a2dc40d, git-hash=3a2dc40d, git-branch=HEAD
[2023-04-21 00:45:34,492] [INFO] [] [Rank -1] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
**** found and replaced vae w. <class 'deepspeed.model_implementations.diffusers.vae.DSVAE'>
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/transformer_inference/
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Time to load transformer_inference op: 0.3998148441314697 seconds
[2023-04-21 00:45:35,814] [INFO] [] [Rank -1] DeepSpeed-Attention config: {'layer_id': 0, 'hidden_size': 320, 'intermediate_size': 1280, 'heads': 5, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-12, 'mp_size': 1, 'q_int8': False, 'scale_attention': True, 'triangular_masking': False, 'local_attention': False, 'window_size': 256, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': <ActivationFuncType.GELU: 1>, 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': False, 'max_out_tokens': 4096, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Loading extension module transformer_inference...
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module transformer_inference, skipping build step...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.07170987129211426 seconds
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /home/ubuntu/.cache/torch_extensions/py38_cu116/spatial_inference/
Building extension module spatial_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.4000411033630371 seconds
**** found and replaced unet w. <class 'deepspeed.model_implementations.diffusers.unet.DSUNet'>
Using /home/ubuntu/.cache/torch_extensions/py38_cu116 as PyTorch extensions root...
No modifications detected for re-loaded extension module spatial_inference, skipping build step...
Loading extension module spatial_inference...
Time to load spatial_inference op: 0.07284045219421387 seconds
DeepSpeed model: P95 latency (seconds) - 2.60; Average latency (seconds) - 2.59 +\- 0.01;

ds_report output

Screenshots aimg2

System info (please complete the following information):

sungeuns commented 1 year ago

I have similar issue when I use StableDiffusion 2.1 with DeepSpeed. Any ideas?

saqibameen commented 7 months ago

Anyone tried integrating it w SDv2.1 locally instead of using it with HF?