siliconflow / onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.
https://github.com/siliconflow/onediff/wiki
Apache License 2.0
1.48k stars 87 forks source link

关于多分辨率大 batch 加速失败的问题 #744

Open lovejing0306 opened 4 months ago

lovejing0306 commented 4 months ago

Describe the bug

在显卡 A10(24G 显存) 上,加速多分辨率,同时每个分辨率生成 2 张图片时,出现错误

Your environment

diffusers==0.27.0
transformers==4.38.2
xformers==0.0.23.post1
peft==0.7.1

# For CN users
python3 -m pip install -U --pre oneflow -f https://oneflow-pro.oss-cn-beijing.aliyuncs.com/branch/community/cu121
python3 -m pip install --pre onediff

git clone https://github.com/siliconflow/onediff.git
cd onediff_diffusers_extensions && python3 -m pip install -e .

How To Reproduce

Steps to reproduce the behavior(code or script):

import time
from PIL import Image
import oneflow as flow
import torch

from onediff.infer_compiler import oneflow_compile
from diffusers import LCMScheduler
from third_party.diffusers_mc.pipeline_stable_diffusion_xl_img2img import StableDiffusionXLImg2ImgPipeline

model_dir = 'ckpts/playground-v2'

# Model load and compile
pipeline = StableDiffusionXLImg2ImgPipeline.from_pretrained(
    model_dir,
    torch_dtype=torch.float16,
    variant="fp16",
    use_safetensors=True,
)
pipeline.scheduler = LCMScheduler.from_config(pipeline.scheduler.config)
pipeline.safety_checker = None
pipeline.to('cuda', torch_dtype=torch.float16)

pipeline.unet = oneflow_compile(pipeline.unet)
pipeline.vae.decoder = oneflow_compile(pipeline.vae.decoder)

prompt = "a photo of an astronaut riding a horse on mars"

# Warm-up
warmup_sizes = [(1024, 1024)]
for size in warmup_sizes:
    _ = pipeline(prompt=prompt, height=size[0], width=size[1])

# Normal inference
inference_sizes = [(1024, 1024), (512, 2048), (2048, 512)]
for size in inference_sizes:
    start_time = time.time()
    image = pipeline(
        prompt=prompt,
        height=size[0],
        width=size[1],
        num_inference_steps=4,
        num_images_per_prompt=2,
        strength=1.0,
    ).images[0]
    end_time = time.time()
    print('time:', end_time-start_time)

The complete error message

Stack trace (most recent call last) in thread 1644:
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc73a15f1f, in
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bc3f9a7, in
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bc3f21c, in
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bc3aa98, in vm::ThreadCtx::TryReceiveAndRun()
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bbdd234, in vm::EpStreamPolicyBase::Run(vm::Instruction*) const
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bbe0537, in vm::Instruction::Compute()
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bbe7918, in vm::OpCallInstructionPolicy::Compute(vm::Instruction*)
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bbe75e9, in
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc6bbe273a, in
   Object "/opt/conda/lib/python3.10/site-packages/oneflow/../oneflow.libs/liboneflow-74324398.so", at 0x7fbc633e3d3c, in

Aborted (Signal sent by tkill() 1536 0)
Aborted (core dumped)

Additional context

但是如果不使用 onediff 进行加速的话,当 batch size 为 2 的时候可以正常运行

strint commented 4 months ago

执行时观察下 gpu 显存占用看看,可能是 OOM 了。

lovejing0306 commented 4 months ago

执行时观察下 gpu 显存占用看看,可能是 OOM 了。

观察到使用 onediff 后显存占用确实变多了。这个有什么优化的办法吗?

strint commented 4 months ago

执行时观察下 gpu 显存占用看看,可能是 OOM 了。

观察到使用 onediff 后显存占用确实变多了。这个有什么优化的办法吗?

可以参考这里: https://github.com/siliconflow/onediff/issues/605#issuecomment-1980574638

是因为显存池没有共享带来的,当前这个版本还没很好处理方法。我们计划在下个大版本解决下这个问题,不过需要点时间。

lovejing0306 commented 4 months ago

执行时观察下 gpu 显存占用看看,可能是 OOM 了。

观察到使用 onediff 后显存占用确实变多了。这个有什么优化的办法吗?

可以参考这里: #605 (comment)

是因为线程池没有共享带来的,当前这个版本还没很好处理方法。我们计划在下个大版本解决下这个问题,不过需要点时间。

ok,那我等下个新的版本吧。多谢了。

strint commented 4 days ago

https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions/examples/sd3

@lovejing0306 请参考这个例子试用下 nexfort,这里显存池和torch 是复用的,几乎不增加显存