siliconflow / onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.
https://github.com/siliconflow/onediff/wiki
Apache License 2.0
1.61k stars 99 forks source link

SDXL Model Inference Performance Issue #937

Closed pingren closed 2 months ago

pingren commented 3 months ago

Describe the bug

The initial steps of inferencing with the SDXL model are significantly slower after the model is loaded.

Your environment

OS

Debian Slim Linux

OneDiff git commit id

f6a7224cc4754300dfc698385d62b3b329e54742

OneFlow version info

Run python -m oneflow --doctor and paste it here.

libibverbs not available, ibv_fork_init skipped
path: ['/usr/local/lib/python3.10/site-packages/oneflow']
version: 0.9.1.dev20240515+cu121
git_commit: ec7b682
cmake_build_type: Release
rdma: True
mlir: True
enterprise: False

How To Reproduce

Steps to reproduce the behavior(code or script):

  1. Use --highvram --fp16-unet on Comfy CLI Args

  2. Use simple_SDXL_workflow.json

  3. Change SDXL checkpoints before inference, e.g. https://civitai.com/models/133005?modelVersionId=456194 https://civitai.com/models/269232/aam-xl-anime-mix

  4. Click queue prompt

CheckpointLoaderSimple

got prompt
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
Requested to load SDXLClipModel
Loading 1 new model
100%|██████████| 25/25 [00:07<00:00,  3.36it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 13.45 seconds
got prompt
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
Requested to load SDXLClipModel
Loading 1 new model
100%|██████████| 25/25 [00:07<00:00,  3.44it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 11.45 seconds

OneDiffCheckpointLoaderSimple

got prompt
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
Requested to load SDXLClipModel
Loading 1 new model
 56%|█████▌    | 14/25 [00:05<00:02,  4.00it/s]
100%|██████████| 25/25 [00:08<00:00,  3.01it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 14.90 seconds
got prompt
model_type EPS
Using pytorch attention in VAE
Using pytorch attention in VAE
loaded straight to GPU
Requested to load SDXL
Loading 1 new model
Requested to load SDXLClipModel
Loading 1 new model
100%|██████████| 25/25 [00:07<00:00,  3.21it/s]
Requested to load AutoencoderKL
Loading 1 new model
Prompt executed in 13.96 seconds

The total inference steps are 25. The overhead of OneDiffCheckpointLoaderSimple makes it slower than CheckpointLoaderSimple. Once the steps are increased to 100, OneDiffCheckpointLoaderSimple becomes faster.

The first a few steps are significantly slower compared to CheckpointLoaderSimple.

Screenshot 2024-06-07 at 10 25 20
strint commented 2 months ago

The first a few steps are significantly slower compared to CheckpointLoaderSimple.

This is as expected.

This first run will cost 10~60 seconds to do optimization. After this, it will run very fast.

strint commented 2 months ago

@pingren