StableVideoDiffusionPipeline 加速 V100 能执行成功，A30 执行报错

lss15151161 commented 2 weeks ago

环境： python3.9 torch2.1.0+cu121 onediff 1.2.0.dev202406150129 onediffx 1.2.0.dev202406150129 oneflow 0.9.1.dev20240615+cu121 说明：V100上测试观察，现存最大需要16G，不超过A30的24G现存。应该不是显存不够报错：

lijunliangTG commented 2 weeks ago

能否说明下复现方式，以及采用的模型

lss15151161 commented 2 weeks ago

能否说明下复现方式，以及采用的模型目前定位到的问题是，unet使用onediff优化后，在A100上执行结果为nan，但是V100上结果正常。采用的模型：StableVideoDiffusionPipeline 环境：diffusers=0.24.0 复现方式：
import torch
from diffusers import StableVideoDiffusionPipeline
from onediffx import compile_pipe
from PIL import Image
import requests
from io import BytesIO

def load_pil_image(image_path, mode='RGB'): if image_path.startswith("http"): content = requests.get(image_path).content image_file = BytesIO(content) else: image_file = image_path with open(image_path, "rb") as f: content = f.read() image_pil = Image.open(image_file).convert(mode) return image_pil, content

class EndpointHandler(object): def init(self, debug=False):

    self.pipeline_rot = StableVideoDiffusionPipeline.from_pretrained(
                    "./model/cm_rotation_v1", 
                    torch_dtype=torch.float16,
                    variant="fp16")

    self.pipeline_rot.to("cuda")
    self.pipeline_rot = compile_pipe(self.pipeline_rot, ignores=["image_encoder", "vae"])
    self.debug = debug
    self.gen_size = (512, 512)  # (w, h)

@torch.no_grad()
def __call__(self, request, need_url=True, local_save=False):
    # 得到基本信息
    img_url = "xxx"
    image_pil, image_bin = load_pil_image(img_url)

    seed = 43
    image_pil = image_pil.resize(self.gen_size)
    t_width, t_height = image_pil.size

    frames = self.pipeline_rot(image_pil,
                num_frames=14,
                decode_chunk_size=8,
                motion_bucket_id=64,
                height=t_height,
                width=t_width,
                generator=torch.manual_seed(seed),
                ).frames[0]
    return frames

if name == 'main': handler = EndpointHandler(debug=True) for i in range(2): resp = handler()

lijunliangTG commented 1 week ago

您使用的权重是什么？我没有在HF上找到cm_rotation_v1 权重

lss15151161 commented 1 week ago

您使用的权重是什么？我没有在HF上找到cm_rotation_v1 权重

是的，权重是我自己重新微调的。不过没有优化的模型用fp16是没问题的，V100上也没有问题。所以应该不是权重的原因。您可以直接用hf上的开源权重就行。目前已定位到是 UNetSpatioTemporalConditionModel 模型导致的溢出

lijunliangTG commented 1 week ago

我采用这个 stabilityai/stable-video-diffusion-img2vid-xt 模型权重可以在3090上正常运行，峰值显存占用接近24G。

torch 版本 2.3.0+cu121 python 3.10 onediff 1.2.0.dev1

lss15151161 commented 1 week ago

我采用这个 stabilityai/stable-video-diffusion-img2vid-xt 模型权重可以在3090上正常运行，峰值显存占用接近24G。

torch 版本 2.3.0+cu121 python 3.10 onediff 1.2.0.dev1

版本应该没问题，就是在A100， A30上会有问题，V100测试也正常

lijunliangTG commented 1 week ago

我在A100上可以正常运行，可以重装一下OneDiff和OneFlow https://github.com/siliconflow/onediff/blob/main/README.md#optional-install-oneflow

lss15151161 commented 1 week ago

我在A100上可以正常运行，可以重装一下OneDiff和OneFlow https://github.com/siliconflow/onediff/blob/main/README.md#optional-install-oneflow

生成结果是对的吗

lss15151161 commented 1 week ago

我在A100上可以正常运行，可以重装一下OneDiff和OneFlow https://github.com/siliconflow/onediff/blob/main/README.md#optional-install-oneflow

我也可以执行成功，但是生成的视频是全黑的图片

lijunliangTG commented 1 week ago

我也遇到了这个问题，现在还在排查中

lss15151161 commented 1 week ago

我也遇到了这个问题，现在还在排查中

原因应该是unet结果全为nan导致的。但是奇怪的是在V100上是正常。可以见这条issue：https://github.com/siliconflow/onediff/issues/958

lss15151161 commented 1 week ago

我也遇到了这个问题，现在还在排查中

您好，请问有排查出问题吗

lijunliangTG commented 6 days ago

您好，目前已经成功复现，还在排查问题中

lijunliangTG commented 4 days ago

您可以先尝试 export ONEFLOW_ATTENTION_ALLOW_HALF_PRECISION_ACCUMULATION=False 这样设置之后可以正常出图，具体原因还在排查中

siliconflow / onediff

StableVideoDiffusionPipeline 加速 V100 能执行成功，A30 执行报错 #955