Inference speed half on version 0.26.0 upwards than 0.25.1

junnikokuki commented 1 month ago

Describe the bug

Hi, I test the controlnet img2img inference on same machine and same input(image and prompt), the difference is the diffusers version, 0.25.1 vs 0.29.2. The speed of 0.25.1 in logs is 16it/s on average and 8it/s on 0.29.2. I also test inpaint and txt2img, also have performance loss.

Reproduction

Pipeline: StableDiffusionControlNetImg2ImgPipeline Controlnet: lllyasviel_control_v11p_sd15_openpose Scheduler: UniPCMultistepScheduler

user tomesd: pipe = tomesd.apply_patch(pipe, ratio=0.5)

Model: https://civitai.com/models/4384?modelVersionId=303662

Using 0.29.2 to test the speed, then pip3 install diffusers==0.25.1

Logs

0.25.1:
  0%|          | 0/20 [00:00<?, ?it/s]:25,400: WARNING/MainProcess] 
  5%|5         | 1/20 [00:00<00:02,  7.84it/s] WARNING/MainProcess] 
 15%|#5        | 3/20 [00:00<00:01, 13.19it/s] WARNING/MainProcess] 
 25%|##5       | 5/20 [00:00<00:00, 15.42it/s] WARNING/MainProcess] 
 35%|###5      | 7/20 [00:00<00:00, 16.54it/s] WARNING/MainProcess] 
 45%|####5     | 9/20 [00:00<00:00, 17.15it/s] WARNING/MainProcess] 
 55%|#####5    | 11/20 [00:00<00:00, 17.54it/s]WARNING/MainProcess] 
 65%|######5   | 13/20 [00:00<00:00, 17.78it/s]WARNING/MainProcess] 
 75%|#######5  | 15/20 [00:00<00:00, 17.94it/s]WARNING/MainProcess] 
 85%|########5 | 17/20 [00:01<00:00, 18.05it/s]WARNING/MainProcess] 
 95%|#########5| 19/20 [00:01<00:00, 18.13it/s]WARNING/MainProcess] 
100%|##########| 20/20 [00:01<00:00, 17.07it/s]WARNING/MainProcess] 
               | [2024-07-17 06:27:26,647: WARNING/MainProcess] 1.322s

0.29.2
  0%|          | 0/20 [00:00<?, ?it/s]:00,700: WARNING/MainProcess] 
  5%|5         | 1/20 [00:00<00:03,  5.31it/s] WARNING/MainProcess] 
 10%|#         | 2/20 [00:00<00:03,  5.97it/s] WARNING/MainProcess] 
 15%|#5        | 3/20 [00:00<00:02,  6.12it/s] WARNING/MainProcess] 
 20%|##        | 4/20 [00:00<00:02,  6.28it/s] WARNING/MainProcess] 
 25%|##5       | 5/20 [00:00<00:02,  6.37it/s] WARNING/MainProcess] 
 30%|###       | 6/20 [00:00<00:02,  6.43it/s] WARNING/MainProcess] 
 35%|###5      | 7/20 [00:01<00:02,  6.46it/s] WARNING/MainProcess] 
 40%|####      | 8/20 [00:01<00:01,  6.49it/s] WARNING/MainProcess] 
 45%|####5     | 9/20 [00:01<00:01,  6.50it/s] WARNING/MainProcess] 
 50%|#####     | 10/20 [00:01<00:01,  6.51it/s]WARNING/MainProcess] 
 55%|#####5    | 11/20 [00:01<00:01,  6.52it/s]WARNING/MainProcess] 
 60%|######    | 12/20 [00:01<00:01,  6.53it/s]WARNING/MainProcess] 
 65%|######5   | 13/20 [00:02<00:01,  6.53it/s]WARNING/MainProcess] 
 70%|#######   | 14/20 [00:02<00:00,  6.53it/s]WARNING/MainProcess] 
 75%|#######5  | 15/20 [00:02<00:00,  6.53it/s]WARNING/MainProcess] 
 80%|########  | 16/20 [00:02<00:00,  6.53it/s]WARNING/MainProcess] 
 85%|########5 | 17/20 [00:02<00:00,  6.53it/s]WARNING/MainProcess] 
 90%|######### | 18/20 [00:02<00:00,  6.53it/s]WARNING/MainProcess] 
 95%|#########5| 19/20 [00:02<00:00,  6.53it/s]WARNING/MainProcess] 
100%|##########| 20/20 [00:03<00:00,  6.53it/s]WARNING/MainProcess] 
100%|##########| 20/20 [00:03<00:00,  6.45it/s]WARNING/MainProcess] 
               | [2024-07-17 06:32:03,920: WARNING/MainProcess] 3.31s

System Info

4090 ubuntu 22.04, cuda 11.8.0, Python 3.10.6

diffusers 0.29.2 vs 0.25.1

accelerate 0.32.1 albumentations 1.4.2 amqp 5.2.0 antlr4-python3-runtime 4.9.3 asttokens 2.4.1 async-timeout 4.0.3 billiard 4.2.0 celery 5.3.6 certifi 2024.2.2 charset-normalizer 3.3.2 click 8.1.7 click-didyoumean 0.3.0 click-plugins 1.1.1 click-repl 0.3.0 cmake 3.28.4 coloredlogs 15.0.1 comm 0.2.2 contourpy 1.2.0 cycler 0.12.1 Cython 3.0.9 decorator 5.1.1 distro 1.9.0 easydict 1.13 exceptiongroup 1.2.0 executing 2.0.1 filelock 3.13.1 flatbuffers 24.3.7 fonttools 4.50.0 fsspec 2024.3.1 huggingface-hub 0.23.5 humanfriendly 10.0 idna 3.6 imageio 2.34.0 importlib_metadata 7.1.0 insightface 0.7.3 ipython 8.22.2 ipywidgets 8.1.2 jedi 0.19.1 Jinja2 3.1.3 joblib 1.3.2 jupyterlab_widgets 3.0.10 kiwisolver 1.4.5 kombu 5.3.5 lazy_loader 0.3 lpips 0.1.4 MarkupSafe 2.1.5 matplotlib 3.8.3 matplotlib-inline 0.1.6 mpmath 1.3.0 networkx 3.2.1 numpy 1.26.4 nvidia-cublas-cu12 12.1.3.1 nvidia-cuda-cupti-cu12 12.1.105 nvidia-cuda-nvrtc-cu12 12.1.105 nvidia-cuda-runtime-cu12 12.1.105 nvidia-cudnn-cu12 8.9.2.26 nvidia-cufft-cu12 11.0.2.54 nvidia-curand-cu12 10.3.2.106 nvidia-cusolver-cu12 11.4.5.107 nvidia-cusparse-cu12 12.1.0.106 nvidia-nccl-cu12 2.20.5 nvidia-nvjitlink-cu12 12.4.99 nvidia-nvtx-cu12 12.1.105 omegaconf 2.3.0 onnx 1.15.0 onnxruntime-gpu 1.17.1 opencv-python 4.9.0.80 opencv-python-headless 4.9.0.80 packaging 24.0 parso 0.8.3 pexpect 4.9.0 pillow 10.2.0 pip 24.0 prettytable 3.10.0 prompt-toolkit 3.0.43 protobuf 5.26.0 psutil 5.9.8 ptyprocess 0.7.0 pure-eval 0.2.2 pyfacer 0.0.4 Pygments 2.17.2 pyparsing 3.1.2 python-dateutil 2.9.0.post0 PyYAML 6.0.1 redis 5.0.3 regex 2023.12.25 requests 2.31.0 safetensors 0.4.2 scikit-build 0.17.6 scikit-image 0.22.0 scikit-learn 1.4.1.post1 scipy 1.12.0 setuptools 59.6.0 six 1.16.0 stack-data 0.6.3 sympy 1.12 tabulate 0.9.0 threadpoolctl 3.4.0 tifffile 2024.2.12 timm 0.9.16 tokenizers 0.19.1 tomesd 0.1.3 tomli 2.0.1 torch 2.3.1 torchvision 0.18.1 tqdm 4.66.2 traitlets 5.14.2 transformers 4.42.4 triton 2.3.1 typing_extensions 4.10.0 tzdata 2024.1 urllib3 2.2.1 validators 0.23.2 vine 5.1.0 wcwidth 0.2.13 wheel 0.37.1 widgetsnbextension 4.0.10 zipp 3.18.1

Who can help?

No response

tolgacangoz commented 1 month ago

For benchmarking purposes, I would prefer PyTorch's own profiler and benchmarking tools. To be certain, could you confirm your results with PyTorch's tools?

junnikokuki commented 1 month ago

For benchmarking purposes, I would prefer PyTorch's own profiler and benchmarking tools. To be certain, could you confirm your results with PyTorch's tools?

Using torch.autograd.profiler and torch.backends.cudnn.benchmark = True?

junnikokuki commented 1 month ago

diffusers-profiler.zip

junnikokuki commented 1 month ago

profiler json and key_averages uploaded

tolgacangoz commented 1 month ago

Thanks! Could you also share the exact code you ran (for full reproducibility)?

junnikokuki commented 1 month ago

profiler.py.txt

this is a runnable python script. You have to install controlnet_aux via pip and prepare lllyasviel_control_v11p_sd15_openpose and dreamshaperV8.safetensors to run.

yiyixuxu commented 1 month ago

@junnikokuki

Can you try this script to see if you still see the performance loss (it is just a lot more simplified than the one you provided, so it will help us narrow down the cause)?

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch

import cv2
from PIL import Image

# download an image
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
np_image = np.array(image)

# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)

# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)

# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# generate image
generator = torch.manual_seed(0)
image = pipe(
    "futuristic-looking woman",
    num_inference_steps=20,
    generator=generator,
    image=image,
    control_image=canny_image,
).images[0]

junnikokuki commented 1 month ago

@junnikokuki

Can you try this script to see if you still see the performance loss (it is just a lot more simplified than the one you provided, so it will help us narrow down the cause)?

from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch

import cv2
from PIL import Image

# download an image
image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
np_image = np.array(image)

# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)

# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)

# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

# generate image
generator = torch.manual_seed(0)
image = pipe(
    "futuristic-looking woman",
    num_inference_steps=20,
    generator=generator,
    image=image,
    control_image=canny_image,
).images[0]

This code has slightly performance loss(1.2vs1.3 cuda), you can consider it does not trigger the problem.

junnikokuki commented 1 month ago

@yiyixuxu What should I test next?

huggingface / diffusers