siliconflow / onediff

OneDiff: An out-of-the-box acceleration library for diffusion models.
https://github.com/siliconflow/onediff/wiki
Apache License 2.0
1.64k stars 98 forks source link

test_pipelines_oneflow_graph_load out of host memory error in WSL #95

Closed strint closed 1 year ago

strint commented 1 year ago

The running environment is wsl2 Ubuntu 20.04, neither the host nor wsl2 is running any other CUDA programs.

ubuntu@DESKTOP-531RKJN:~$ python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped

==> Try to run graph save...
==> get_pipe  try to run
get_pipe  cuda mem before  1301.5
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 56488.94it/s]
get_pipe  run time  15.074813842773438
get_pipe  cuda mem after  1301.5
get_pipe  cuda mem diff  0.0
<== get_pipe  finish run

==> pipe_to_cuda  try to run
pipe_to_cuda  cuda mem before  1301.5
pipe_to_cuda  run time  1.1066811084747314
pipe_to_cuda  cuda mem after  4061.5
pipe_to_cuda  cuda mem diff  2760.0
<== pipe_to_cuda  finish run

==> config_graph  try to run
config_graph  cuda mem before  4061.5
config_graph  run time  1.5735626220703125e-05
config_graph  cuda mem after  4061.5
config_graph  cuda mem diff  0.0
<== config_graph  finish run

sd init time  16.18261170387268 s.
==> text_to_image  try to run
text_to_image  cuda mem before  4061.5
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.53it/s]
W20230210 00:32:48.454388  8336 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  9.699114561080933
text_to_image  cuda mem after  8125.5
text_to_image  cuda mem diff  4064.0
<== text_to_image  finish run

==> text_to_image  try to run
text_to_image  cuda mem before  8125.5
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:15<00:00,  3.17it/s]
text_to_image  run time  23.669822216033936
text_to_image  cuda mem after  9561.5
text_to_image  cuda mem diff  1436.0
<== text_to_image  finish run

====> diff  0.0023254268
st init and run time  49.55777668952942 s.
==> save_pipe_sch  try to run
save_pipe_sch  cuda mem before  9561.5
terminate called after throwing an instance of 'oneflow::RuntimeException'
  what():  Error: out of memory
Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp:209
        OpCallInstructionUtil::Compute(this, instruction): copy:OpCall:s_d2h

  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in Compute
    OpCallInstructionUtil::Compute(this, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 41, in Compute
    AllocateOutputBlobsMemory(op_call_instruction_policy, allocator, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 89, in AllocateOutputBlobsMemory
    blob_object->TryAllocateBlobBodyMemory(allocator)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 100, in TryAllocateBlobBodyMemory
    allocator->Allocate(&dptr, required_body_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
    AllocateBlockToExtendTotalMem(aligned_size)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
    backend_->Allocate(&mem_ptr, final_allocate_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/ep_backend_host_allocator.cpp", line 25, in Allocate
    ep_device_->AllocPinned(allocation_options_, reinterpret_cast<void**>(mem_ptr), size)
Error Type: oneflow.ErrorProto.runtime_error
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in operator()

Error Type: oneflow.ErrorProto.runtime_error
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted
ubuntu@DESKTOP-531RKJN:~$ 

Originally posted by @MirrorCY in https://github.com/Oneflow-Inc/diffusers/issues/75#issuecomment-1424482749

strint commented 1 year ago

@MirrorCY

可以更新下 _cost_cnt,新增了 host mem 统计:

def _cost_cnt(fn):
    def new_fn(*args, **kwargs):
        print("==> function ", fn.__name__, " try to run...")
        flow._oneflow_internal.eager.Sync()
        before_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem before ",  before_used, " MB")
        before_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem before ",  before_host_used, " MB")
        start_time = time.time()
        out = fn(*args, **kwargs)
        flow._oneflow_internal.eager.Sync()
        end_time = time.time()
        print(fn.__name__, " run time ", end_time - start_time, " seconds")
        after_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem after ", after_used, " MB")
        print(fn.__name__, " cuda mem diff ", after_used - before_used, " MB")
        after_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem after ",  after_host_used, " MB")
        print(fn.__name__, " host mem diff ", after_host_used - before_host_used, " MB")
        print("<== function ", fn.__name__, " finish run.")
        print("")
        return out

    return new_fn

这是我在 save 阶段的日志

==> Try to run graph save...
==> function  get_pipe  try to run...
get_pipe  cuda mem before  652.0  MB
get_pipe  host mem before  2163.0  MB
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18606.89it/s]
get_pipe  run time  34.62615132331848  seconds
get_pipe  cuda mem after  652.0  MB
get_pipe  cuda mem diff  0.0  MB
get_pipe  host mem after  8467.0  MB
get_pipe  host mem diff  6304.0  MB
<== function  get_pipe  finish run.

==> function  pipe_to_cuda  try to run...
pipe_to_cuda  cuda mem before  652.0  MB
pipe_to_cuda  host mem before  8467.0  MB
pipe_to_cuda  run time  1.7569012641906738  seconds
pipe_to_cuda  cuda mem after  3770.0  MB
pipe_to_cuda  cuda mem diff  3118.0  MB
pipe_to_cuda  host mem after  9470.0  MB
pipe_to_cuda  host mem diff  1003.0  MB
<== function  pipe_to_cuda  finish run.

==> function  config_graph  try to run...
config_graph  cuda mem before  3770.0  MB
config_graph  host mem before  9470.0  MB
config_graph  run time  6.890296936035156e-05  seconds
config_graph  cuda mem after  3770.0  MB
config_graph  cuda mem diff  0.0  MB
config_graph  host mem after  9470.0  MB
config_graph  host mem diff  0.0  MB
<== function  config_graph  finish run.

sd init time  36.38862633705139 s.
==> function  text_to_image  try to run...
text_to_image  cuda mem before  3770.0  MB
text_to_image  host mem before  9470.0  MB
100%|█████| 50/50 [00:13<00:00,  3.62it/s]
W20230210 01:54:00.468230 291033 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  15.075793743133545  seconds
text_to_image  cuda mem after  7854.0  MB
text_to_image  cuda mem diff  4084.0  MB
text_to_image  host mem after  10003.0  MB
text_to_image  host mem diff  533.0  MB
<== function  text_to_image  finish run.

==> function  text_to_image  try to run...
text_to_image  cuda mem before  7854.0  MB
text_to_image  host mem before  10003.0  MB
/home/xuxiaoyu/dev/oneflow/python/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|█████| 50/50 [00:13<00:00,  3.82it/s]
text_to_image  run time  19.789249181747437  seconds
text_to_image  cuda mem after  8202.0  MB
text_to_image  cuda mem diff  348.0  MB
text_to_image  host mem after  7869.0  MB
text_to_image  host mem diff  -2134.0  MB
<== function  text_to_image  finish run.

====> diff  0.0013520131
st init and run time  71.27231931686401 s.
==> function  save_pipe_sch  try to run...
save_pipe_sch  cuda mem before  8202.0  MB
save_pipe_sch  host mem before  7869.0  MB
save_pipe_sch  run time  6.8812150955200195  seconds
save_pipe_sch  cuda mem after  8204.0  MB
save_pipe_sch  cuda mem diff  2.0  MB
save_pipe_sch  host mem after  9686.0  MB
save_pipe_sch  host mem diff  1817.0  MB
<== function  save_pipe_sch  finish run.

==> function  save_graph  try to run...
save_graph  cuda mem before  8204.0  MB
save_graph  host mem before  9686.0  MB
save_graph  run time  3.215217351913452  seconds
save_graph  cuda mem after  8206.0  MB
save_graph  cuda mem diff  2.0  MB
save_graph  host mem after  9715.0  MB
save_graph  host mem diff  29.0  MB
<== function  save_graph  finish run.
MirrorCY commented 1 year ago

早上好!这就爬起来开测~

strint commented 1 year ago

看起来是 wsl 组合 cuda 使用时,存在 Pinned system memory 大小限制:

MirrorCY commented 1 year ago

看起来就是这个问题了,我翻一翻相关内容找找 workaround

对了,这里是新日志

ubuntu@DESKTOP-531RKJN:~$ python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped

==> Try to run graph save...
==> function  get_pipe  try to run...
get_pipe  cuda mem before  1301.5  MB
get_pipe  host mem before  1794.0  MB
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 43203.13it/s]
get_pipe  run time  15.790169954299927  seconds
get_pipe  cuda mem after  1301.5  MB
get_pipe  cuda mem diff  0.0  MB
get_pipe  host mem after  7648.0  MB
get_pipe  host mem diff  5854.0  MB
<== function  get_pipe  finish run.

==> function  pipe_to_cuda  try to run...
pipe_to_cuda  cuda mem before  1301.5  MB
pipe_to_cuda  host mem before  7648.0  MB
pipe_to_cuda  run time  1.3050522804260254  seconds
pipe_to_cuda  cuda mem after  4055.5  MB
pipe_to_cuda  cuda mem diff  2754.0  MB
pipe_to_cuda  host mem after  8612.0  MB
pipe_to_cuda  host mem diff  964.0  MB
<== function  pipe_to_cuda  finish run.

==> function  config_graph  try to run...
config_graph  cuda mem before  4055.5  MB
config_graph  host mem before  8612.0  MB
config_graph  run time  1.8358230590820312e-05  seconds
config_graph  cuda mem after  4055.5  MB
config_graph  cuda mem diff  0.0  MB
config_graph  host mem after  8612.0  MB
config_graph  host mem diff  0.0  MB
<== function  config_graph  finish run.

sd init time  17.098323345184326 s.
==> function  text_to_image  try to run...
text_to_image  cuda mem before  4055.5  MB
text_to_image  host mem before  8612.0  MB
100%|████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.19it/s]
W20230210 10:15:38.559587 32174 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  10.316318035125732  seconds
text_to_image  cuda mem after  8121.5  MB
text_to_image  cuda mem diff  4066.0  MB
text_to_image  host mem after  9057.0  MB
text_to_image  host mem diff  445.0  MB
<== function  text_to_image  finish run.

==> function  text_to_image  try to run...
text_to_image  cuda mem before  8121.5  MB
text_to_image  host mem before  9057.0  MB
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|████████████████████████████████████████████████| 50/50 [00:15<00:00,  3.17it/s]
text_to_image  run time  26.04283618927002  seconds
text_to_image  cuda mem after  9557.5  MB
text_to_image  cuda mem diff  1436.0  MB
text_to_image  host mem after  7308.0  MB
text_to_image  host mem diff  -1749.0  MB
<== function  text_to_image  finish run.

====> diff  0.0021405134
st init and run time  53.46802878379822 s.
==> function  save_pipe_sch  try to run...
save_pipe_sch  cuda mem before  9557.5  MB
save_pipe_sch  host mem before  7308.0  MB
terminate called after throwing an instance of 'oneflow::RuntimeException'
  what():  Error: out of memory
Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp:209
        OpCallInstructionUtil::Compute(this, instruction): copy:OpCall:s_d2h

  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in Compute
    OpCallInstructionUtil::Compute(this, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 41, in Compute
    AllocateOutputBlobsMemory(op_call_instruction_policy, allocator, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 89, in AllocateOutputBlobsMemory
    blob_object->TryAllocateBlobBodyMemory(allocator)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 100, in TryAllocateBlobBodyMemory
    allocator->Allocate(&dptr, required_body_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
    AllocateBlockToExtendTotalMem(aligned_size)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
    backend_->Allocate(&mem_ptr, final_allocate_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/ep_backend_host_allocator.cpp", line 25, in Allocate
    ep_device_->AllocPinned(allocation_options_, reinterpret_cast<void**>(mem_ptr), size)
Error Type: oneflow.ErrorProto.runtime_error
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in operator()

Error Type: oneflow.ErrorProto.runtime_error
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted
strint commented 1 year ago

这里的讨论,有人解决了 wsl 的问题: https://github.com/huggingface/diffusers/issues/807#issuecomment-1278397383

可能直接用原生的 Ubuntu 是个更简单的办法 : )

MirrorCY commented 1 year ago

版本 Windows 10 专业工作站版 版本号 22H2 安装日期 ‎2022/‎12/‎16 操作系统内部版本 19045.2486 体验 Windows Feature Experience Pack 120.2212.4190.0

看起来我已经是 22H2 了

MirrorCY commented 1 year ago

可能直接用原生的 Ubuntu 是个更简单的办法 : )

是的,这应该能解决此问题但是我只有这一台电脑 😭

strint commented 1 year ago

https://github.com/huggingface/diffusers/issues/807#issuecomment-1278300487

这个作者的描述看起来更完整

MirrorCY commented 1 year ago

wsl --update 报告是最新的,我需要尝试更新到 Windows 11 或者尝试原生 Ubuntu 吗,有需要我可以协助测试。

PS C:\Windows\system32> wsl --update
正在检查更新。
已安装最新版本的适用于 Linux 的 Windows 子系统。

另一个忘记反馈的消息,直接使用单个 OneFlowStableDiffusionPipeline 可以基本满足生产需要,它能跑满我的 24G 显存。

MirrorCY commented 1 year ago

https://github.com/huggingface/diffusers/issues/807#issuecomment-1335534315 好的我跑了这个测试,失败了,问题基本可以确定在 wsl2 上面了

strint commented 1 year ago

@MirrorCY

pin mem 报错在 eager scheduler 的 save。

可以试下这个策略,去掉了 eager scheduler 和 pipe 的 save 和 load,只保留了 graph 的save load:

import time
import os
import gc
import shutil
import unittest
import tempfile

import numpy as np
import oneflow as flow
import oneflow as torch

from diffusers import (
    OneFlowStableDiffusionPipeline as StableDiffusionPipeline,
    OneFlowEulerDiscreteScheduler as EulerDiscreteScheduler,
)
from diffusers import utils

_model_id = "stabilityai/stable-diffusion-2"
_with_image_save = True

def _cost_cnt(fn):
    def new_fn(*args, **kwargs):
        print("==> function ", fn.__name__, " try to run...")
        flow._oneflow_internal.eager.Sync()
        before_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem before ",  before_used, " MB")
        before_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem before ",  before_host_used, " MB")
        start_time = time.time()
        out = fn(*args, **kwargs)
        flow._oneflow_internal.eager.Sync()
        end_time = time.time()
        print(fn.__name__, " run time ", end_time - start_time, " seconds")
        after_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem after ", after_used, " MB")
        print(fn.__name__, " cuda mem diff ", after_used - before_used, " MB")
        after_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem after ",  after_host_used, " MB")
        print(fn.__name__, " host mem diff ", after_host_used - before_host_used, " MB")
        print("<== function ", fn.__name__, " finish run.")
        print("")
        return out

    return new_fn

def _reset_session():
    # Close session to avoid the buffer name duplicate error.
    flow.framework.session_context.TryCloseDefaultSession()
    time.sleep(5)
    flow.framework.session_context.NewDefaultSession(flow._oneflow_global_unique_env)

def _test_sd_graph_save_and_load(is_save, graph_save_path, sch_file_path, pipe_file_path):
    if is_save:
        print("\n==> Try to run graph save...")
        _online_mode = False
        _pipe_from_file = False
    else:
        print("\n==> Try to run graph load...")
        _online_mode = True
        _pipe_from_file = False

    total_start_t = time.time()
    start_t = time.time()
    @_cost_cnt
    def get_pipe():
        if _pipe_from_file:
            scheduler = EulerDiscreteScheduler.from_pretrained(sch_file_path, subfolder="scheduler")
            sd_pipe = StableDiffusionPipeline.from_pretrained(
                pipe_file_path, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
                )
        else:
            scheduler = EulerDiscreteScheduler.from_pretrained(_model_id, subfolder="scheduler")
            sd_pipe = StableDiffusionPipeline.from_pretrained(
                _model_id, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
                )
        return scheduler, sd_pipe
    sch, pipe = get_pipe()

    @_cost_cnt
    def pipe_to_cuda():
        cu_pipe = pipe.to("cuda")
        return cu_pipe
    pipe = pipe_to_cuda()

    @_cost_cnt
    def config_graph():
        pipe.set_graph_compile_cache_size(9)
        pipe.enable_graph_share_mem()
    config_graph()

    if not _online_mode:
        pipe.enable_save_graph()
    else:
        @_cost_cnt
        def load_graph():
            assert (os.path.exists(graph_save_path) and os.path.isdir(graph_save_path))
            pipe.load_graph(graph_save_path, compile_unet=True, compile_vae=False)
        load_graph()
    end_t = time.time()
    print("sd init time ", end_t - start_t, 's.')

    @_cost_cnt
    def text_to_image(prompt, image_size, num_images_per_prompt=1, prefix="", with_graph=False):
        if isinstance(image_size, int):
            image_height = image_size
            image_weight = image_size
        elif isinstance(image_size, (tuple, list)):
            assert len(image_size) == 2
            image_height, image_weight = image_size
        else:
            raise ValueError(f"invalie image_size {image_size}")

        cur_generator = torch.Generator("cuda").manual_seed(1024)
        images = pipe(
            prompt,
            height=image_height,
            width=image_weight,
            compile_unet=with_graph,
            compile_vae=False,
            num_images_per_prompt=num_images_per_prompt,
            generator=cur_generator,
            output_type="np",
        ).images

        if _with_image_save:
            for i, image in enumerate(images):
                pipe.numpy_to_pil(image)[0].save(f"{prefix}{prompt}_{image_height}x{image_weight}_{i}-with_graph_{str(with_graph)}.png")

        return images

    prompt = "a photo of an astronaut riding a horse on mars"

    #sizes = [1024, 896, 768]
    sizes = [1024]
    for i in sizes:
        for j in sizes:
            no_g_images = text_to_image(prompt, (i, j), prefix=f"is_save_{str(is_save)}-", with_graph=False)
            with_g_images = text_to_image(prompt, (i, j), prefix=f"is_save_{str(is_save)}-", with_graph=True)
            assert len(no_g_images) == len(with_g_images)
            for img_idx in range(len(no_g_images)):
                print("====> diff ", np.abs(no_g_images[img_idx] - with_g_images[img_idx]).mean())
                assert np.abs(no_g_images[img_idx] - with_g_images[img_idx]).mean() < 1e-2
    total_end_t = time.time()
    print("st init and run time ", total_end_t - total_start_t, 's.')

    @_cost_cnt
    def save_pipe_sch():
        return
        pipe.save_pretrained(pipe_file_path)
        sch.save_pretrained(sch_file_path)

    @_cost_cnt
    def save_graph():
        assert os.path.exists(graph_save_path) and os.path.isdir(graph_save_path)
        pipe.save_graph(graph_save_path)

    if not _online_mode:
        save_pipe_sch()
        save_graph()

class OneFlowPipeLineGraphSaveLoadTests(unittest.TestCase):
    def tearDown(self):
        # clean up the VRAM after each test
        super().tearDown()
        gc.collect()
        torch.cuda.empty_cache()

    def test_sd_graph_save_and_load(self):
        with tempfile.TemporaryDirectory() as f0:
            with tempfile.TemporaryDirectory() as f1:
                with tempfile.TemporaryDirectory() as f2:
                    _test_sd_graph_save_and_load(True, f0 ,f1, f2)
                    _reset_session()
                    _test_sd_graph_save_and_load(False, f0, f1, f2)

if __name__ == "__main__":
    unittest.main()
strint commented 1 year ago

huggingface#807 (comment) 好的我跑了这个测试,失败了,问题基本可以确定在 wsl2 上面了

赞,这个测试的确可以确认问题。

MirrorCY commented 1 year ago

@MirrorCY pin mem 报错在 eager scheduler 的 save。 可以试下这个策略,去掉了 eager scheduler 和 pipe 的 save 和 load,只保留了 graph 的save load:

python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped

==> Try to run graph save...
==> function  get_pipe  try to run...
get_pipe  cuda mem before  1301.5  MB
get_pipe  host mem before  1787.0  MB
Fetching 12 files: 100%|██████████████████████████| 12/12 [00:00<00:00, 74676.04it/s]
get_pipe  run time  13.726568222045898  seconds
get_pipe  cuda mem after  1301.5  MB
get_pipe  cuda mem diff  0.0  MB
get_pipe  host mem after  7384.0  MB
get_pipe  host mem diff  5597.0  MB
<== function  get_pipe  finish run.

==> function  pipe_to_cuda  try to run...
pipe_to_cuda  cuda mem before  1301.5  MB
pipe_to_cuda  host mem before  7384.0  MB
pipe_to_cuda  run time  0.8773036003112793  seconds
pipe_to_cuda  cuda mem after  4061.5  MB
pipe_to_cuda  cuda mem diff  2760.0  MB
pipe_to_cuda  host mem after  8334.0  MB
pipe_to_cuda  host mem diff  950.0  MB
<== function  pipe_to_cuda  finish run.

==> function  config_graph  try to run...
config_graph  cuda mem before  4061.5  MB
config_graph  host mem before  8334.0  MB
config_graph  run time  2.8848648071289062e-05  seconds
config_graph  cuda mem after  4061.5  MB
config_graph  cuda mem diff  0.0  MB
config_graph  host mem after  8334.0  MB
config_graph  host mem diff  0.0  MB
<== function  config_graph  finish run.

sd init time  14.607499122619629 s.
==> function  text_to_image  try to run...
text_to_image  cuda mem before  4061.5  MB
text_to_image  host mem before  8334.0  MB
100%|████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.35it/s]
W20230210 12:14:05.907799 19811 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  10.030009269714355  seconds
text_to_image  cuda mem after  8125.5  MB
text_to_image  cuda mem diff  4064.0  MB
text_to_image  host mem after  8788.0  MB
text_to_image  host mem diff  454.0  MB
<== function  text_to_image  finish run.

==> function  text_to_image  try to run...
text_to_image  cuda mem before  8125.5  MB
text_to_image  host mem before  8788.0  MB
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|████████████████████████████████████████████████| 50/50 [00:17<00:00,  2.93it/s]
text_to_image  run time  25.487446784973145  seconds
text_to_image  cuda mem after  9561.5  MB
text_to_image  cuda mem diff  1436.0  MB
text_to_image  host mem after  7332.0  MB
text_to_image  host mem diff  -1456.0  MB
<== function  text_to_image  finish run.

====> diff  0.0023374576
st init and run time  50.13253879547119 s.
==> function  save_pipe_sch  try to run...
save_pipe_sch  cuda mem before  9561.5  MB
save_pipe_sch  host mem before  7332.0  MB
save_pipe_sch  run time  1.1682510375976562e-05  seconds
save_pipe_sch  cuda mem after  9561.5  MB
save_pipe_sch  cuda mem diff  0.0  MB
save_pipe_sch  host mem after  7332.0  MB
save_pipe_sch  host mem diff  0.0  MB
<== function  save_pipe_sch  finish run.

==> function  save_graph  try to run...
save_graph  cuda mem before  9561.5  MB
save_graph  host mem before  7332.0  MB
terminate called after throwing an instance of 'oneflow::RuntimeException'
  what():  Error: out of memory
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted
strint commented 1 year ago

terminate called after throwing an instance of 'oneflow::RuntimeException' what(): Error: out of memory You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error. Aborted


@MirrorCY  多谢,看起来绕不过去。都有 cuda to cpu 的操作。
MirrorCY commented 1 year ago

感觉此问题可以暂时搁置一会了,上游问题导致的~

strint commented 1 year ago

好的,那暂时 close 了。