sdxl controlnet inpaint cost more memory than torch

285220927 commented 9 months ago

Describe the bug

A clear and concise description of what the bug is.

测试sdxl controlnet inpaint的时候，onediff显存的占用比pytorch高了接近一半

Your environment

OS

ubuntu 20.04 gpu NVIDIA GeForce RTX 4090 python 3.8 diffusers 0.23.0 onediff 0.12.1.dev202401310124 pytorch 2.0.1 cuda 12.2

OneDiff git commit id

OneFlow version info

Run python -m oneflow --doctor and paste it here. version: 0.9.1.dev20240125+cu122 git_commit: 6458a12 cmake_build_type: Release rdma: True mlir: True enterprise: False

How To Reproduce

Steps to reproduce the behavior(code or script):

import torch
from diffusers import StableDiffusionXLControlNetInpaintPipeline, ControlNetModel

from onediff.infer_compiler import oneflow_compile

device = "cuda:0"
controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16)
pipe = StableDiffusionXLControlNetInpaintPipeline.from_pretrained(
        'stablediffusionapi/dreamshaper-xl',
        controlnet=controlnet,
        torch_dtype=torch.float16
)
pipe = pipe.to(device)

pipe.unet = oneflow_compile(pipe.unet)
pipe.controlnet = oneflow_compile(pipe.controlnet)

image = pipe(
    prompt=prompt,
    image=init_image,
    mask_image=mask_image,
    control_image=control_image,
    strength=1.0,
    controlnet_conditioning_scale=0.5,
    num_inference_steps=30
)

The complete result

pytorch

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:82:00.0 Off |                  Off |
|  0%   38C    P2              64W / 450W |  13380MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

onediff

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:82:00.0 Off |                  Off |
|  0%   36C    P8              22W / 450W |  18382MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

Additional context

Add any other context about the problem here.

strint commented 9 months ago

可以使用 oneflow_compiler_config 关闭一下 mlir_enable_inference_optimization 这个会关闭常量折叠，减少一部分显存开销。（之前测试影响 5%速度）

另外，可以使用 oneflow.cuda.empty_cache() 清理 oneflow 的显存池的缓存

strint commented 9 months ago

diffusers 的话，可以尝试最新加的 compile_pipe ，编译的范围会更全，速度更快

https://github.com/siliconflow/onediff/tree/main/onediff_diffusers_extensions#compile_pipe

285220927 commented 9 months ago

可以使用 oneflow_compiler_config 关闭一下 mlir_enable_inference_optimization 这个会关闭常量折叠，减少一部分显存开销。（之前测试影响 5%速度）

另外，可以使用 oneflow.cuda.empty_cache() 清理 oneflow 的显存池的缓存

我修改之后能减少大约700MiB，但是依然比pytorch多很多，这种现象是正常的吗

onefish51 commented 8 months ago

When using the pipe from diffusers import StableDiffusionControlNetPipeline and from onediffx import compile_pipe, or from onediff.infer_compiler import oneflow_compile in my code, I noticed a significant increase in GPU memory usage from 13GB to 18GB , max is 21GB .

strint commented 8 months ago

When using the pipe from diffusers import StableDiffusionControlNetPipeline and from onediffx import compile_pipe, or from onediff.infer_compiler import oneflow_compile in my code, I noticed a significant increase in GPU memory usage from 13GB to 18GB , max is 21GB .

what about run oneflow.cuda.empty_cache()

Yes, the compiled VAE/ControlNet will take extra memory. We are trying to reduce this cost in the future version, but will not be quick.

@onefish51

onefish51 commented 8 months ago

It is still useful. Previously, when I used torch.cuda.empty_cache(), it didn't work, but when I used oneflow.cuda.empty_cache(), it worked. However, the peak GPU memory usage during runtime remains the same.

and Looking forward to your “the future version”.

strint commented 4 months ago

@onefish51 @285220927

Please refer to this to compile StableDiffusionXLControlNetInpaintPipeline with nexfort backend.

This should cost less cuda memory than before.

import argparse
from PIL import Image

import torch

from onediffx import compile_pipe
from diffusers import StableDiffusionImg2ImgPipeline

prompt = "sea,beach,the waves crashed on the sand,blue sky whit white cloud"

def parse_args():
    parser = argparse.ArgumentParser(description="Simple demo of image generation.")
    parser.add_argument(
        "--model_id", type=str, default="stabilityai/stable-diffusion-2-1",
    )
    cmd_args = parser.parse_args()
    return cmd_args

args = parse_args()

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    args.model_id, use_auth_token=True, revision="fp16", torch_dtype=torch.float16,
)

pipe = pipe.to("cuda")

options = '{"mode": "max-optimize:max-autotune:low-precision", "memory_format": "channels_last"}'
pipe = compile_pipe(pipe, backend="nexfort", options=options, fuse_qkv_projections=True)

img = Image.new("RGB", (512, 512), "#1f80f0")

with flow.autocast("cuda"):
    images = pipe(
        prompt, image=img, guidance_scale=10, num_inference_steps=100, output_type="np",
    ).images
    for i, image in enumerate(images):
        pipe.numpy_to_pil(image)[0].save(f"{prompt}-of-{i}.png")

siliconflow / onediff