test_pipelines_oneflow_graph_load out of host memory error in WSL

strint commented 1 year ago

The running environment is wsl2 Ubuntu 20.04, neither the host nor wsl2 is running any other CUDA programs.

ubuntu@DESKTOP-531RKJN:~$ python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped

==> Try to run graph save...
==> get_pipe  try to run
get_pipe  cuda mem before  1301.5
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 56488.94it/s]
get_pipe  run time  15.074813842773438
get_pipe  cuda mem after  1301.5
get_pipe  cuda mem diff  0.0
<== get_pipe  finish run

==> pipe_to_cuda  try to run
pipe_to_cuda  cuda mem before  1301.5
pipe_to_cuda  run time  1.1066811084747314
pipe_to_cuda  cuda mem after  4061.5
pipe_to_cuda  cuda mem diff  2760.0
<== pipe_to_cuda  finish run

==> config_graph  try to run
config_graph  cuda mem before  4061.5
config_graph  run time  1.5735626220703125e-05
config_graph  cuda mem after  4061.5
config_graph  cuda mem diff  0.0
<== config_graph  finish run

sd init time  16.18261170387268 s.
==> text_to_image  try to run
text_to_image  cuda mem before  4061.5
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.53it/s]
W20230210 00:32:48.454388  8336 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  9.699114561080933
text_to_image  cuda mem after  8125.5
text_to_image  cuda mem diff  4064.0
<== text_to_image  finish run

==> text_to_image  try to run
text_to_image  cuda mem before  8125.5
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:15<00:00,  3.17it/s]
text_to_image  run time  23.669822216033936
text_to_image  cuda mem after  9561.5
text_to_image  cuda mem diff  1436.0
<== text_to_image  finish run

====> diff  0.0023254268
st init and run time  49.55777668952942 s.
==> save_pipe_sch  try to run
save_pipe_sch  cuda mem before  9561.5
terminate called after throwing an instance of 'oneflow::RuntimeException'
  what():  Error: out of memory
Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp:209
        OpCallInstructionUtil::Compute(this, instruction): copy:OpCall:s_d2h

  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in Compute
    OpCallInstructionUtil::Compute(this, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 41, in Compute
    AllocateOutputBlobsMemory(op_call_instruction_policy, allocator, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 89, in AllocateOutputBlobsMemory
    blob_object->TryAllocateBlobBodyMemory(allocator)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 100, in TryAllocateBlobBodyMemory
    allocator->Allocate(&dptr, required_body_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
    AllocateBlockToExtendTotalMem(aligned_size)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
    backend_->Allocate(&mem_ptr, final_allocate_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/ep_backend_host_allocator.cpp", line 25, in Allocate
    ep_device_->AllocPinned(allocation_options_, reinterpret_cast<void**>(mem_ptr), size)
Error Type: oneflow.ErrorProto.runtime_error
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in operator()

Error Type: oneflow.ErrorProto.runtime_error
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted
ubuntu@DESKTOP-531RKJN:~$

Originally posted by @MirrorCY in https://github.com/Oneflow-Inc/diffusers/issues/75#issuecomment-1424482749

strint commented 1 year ago

@MirrorCY

可以更新下 _cost_cnt，新增了 host mem 统计：

def _cost_cnt(fn):
    def new_fn(*args, **kwargs):
        print("==> function ", fn.__name__, " try to run...")
        flow._oneflow_internal.eager.Sync()
        before_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem before ",  before_used, " MB")
        before_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem before ",  before_host_used, " MB")
        start_time = time.time()
        out = fn(*args, **kwargs)
        flow._oneflow_internal.eager.Sync()
        end_time = time.time()
        print(fn.__name__, " run time ", end_time - start_time, " seconds")
        after_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem after ", after_used, " MB")
        print(fn.__name__, " cuda mem diff ", after_used - before_used, " MB")
        after_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem after ",  after_host_used, " MB")
        print(fn.__name__, " host mem diff ", after_host_used - before_host_used, " MB")
        print("<== function ", fn.__name__, " finish run.")
        print("")
        return out

    return new_fn

这是我在 save 阶段的日志

==> Try to run graph save...
==> function  get_pipe  try to run...
get_pipe  cuda mem before  652.0  MB
get_pipe  host mem before  2163.0  MB
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18606.89it/s]
get_pipe  run time  34.62615132331848  seconds
get_pipe  cuda mem after  652.0  MB
get_pipe  cuda mem diff  0.0  MB
get_pipe  host mem after  8467.0  MB
get_pipe  host mem diff  6304.0  MB
<== function  get_pipe  finish run.

==> function  pipe_to_cuda  try to run...
pipe_to_cuda  cuda mem before  652.0  MB
pipe_to_cuda  host mem before  8467.0  MB
pipe_to_cuda  run time  1.7569012641906738  seconds
pipe_to_cuda  cuda mem after  3770.0  MB
pipe_to_cuda  cuda mem diff  3118.0  MB
pipe_to_cuda  host mem after  9470.0  MB
pipe_to_cuda  host mem diff  1003.0  MB
<== function  pipe_to_cuda  finish run.

==> function  config_graph  try to run...
config_graph  cuda mem before  3770.0  MB
config_graph  host mem before  9470.0  MB
config_graph  run time  6.890296936035156e-05  seconds
config_graph  cuda mem after  3770.0  MB
config_graph  cuda mem diff  0.0  MB
config_graph  host mem after  9470.0  MB
config_graph  host mem diff  0.0  MB
<== function  config_graph  finish run.

sd init time  36.38862633705139 s.
==> function  text_to_image  try to run...
text_to_image  cuda mem before  3770.0  MB
text_to_image  host mem before  9470.0  MB
100%|█████| 50/50 [00:13<00:00,  3.62it/s]
W20230210 01:54:00.468230 291033 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  15.075793743133545  seconds
text_to_image  cuda mem after  7854.0  MB
text_to_image  cuda mem diff  4084.0  MB
text_to_image  host mem after  10003.0  MB
text_to_image  host mem diff  533.0  MB
<== function  text_to_image  finish run.

==> function  text_to_image  try to run...
text_to_image  cuda mem before  7854.0  MB
text_to_image  host mem before  10003.0  MB
/home/xuxiaoyu/dev/oneflow/python/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|█████| 50/50 [00:13<00:00,  3.82it/s]
text_to_image  run time  19.789249181747437  seconds
text_to_image  cuda mem after  8202.0  MB
text_to_image  cuda mem diff  348.0  MB
text_to_image  host mem after  7869.0  MB
text_to_image  host mem diff  -2134.0  MB
<== function  text_to_image  finish run.

====> diff  0.0013520131
st init and run time  71.27231931686401 s.
==> function  save_pipe_sch  try to run...
save_pipe_sch  cuda mem before  8202.0  MB
save_pipe_sch  host mem before  7869.0  MB
save_pipe_sch  run time  6.8812150955200195  seconds
save_pipe_sch  cuda mem after  8204.0  MB
save_pipe_sch  cuda mem diff  2.0  MB
save_pipe_sch  host mem after  9686.0  MB
save_pipe_sch  host mem diff  1817.0  MB
<== function  save_pipe_sch  finish run.

==> function  save_graph  try to run...
save_graph  cuda mem before  8204.0  MB
save_graph  host mem before  9686.0  MB
save_graph  run time  3.215217351913452  seconds
save_graph  cuda mem after  8206.0  MB
save_graph  cuda mem diff  2.0  MB
save_graph  host mem after  9715.0  MB
save_graph  host mem diff  29.0  MB
<== function  save_graph  finish run.

MirrorCY commented 1 year ago

早上好！这就爬起来开测～

strint commented 1 year ago

看起来是 wsl 组合 cuda 使用时，存在 Pinned system memory 大小限制：

MirrorCY commented 1 year ago

看起来就是这个问题了，我翻一翻相关内容找找 workaround

对了，这里是新日志

ubuntu@DESKTOP-531RKJN:~$ python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped

==> Try to run graph save...
==> function  get_pipe  try to run...
get_pipe  cuda mem before  1301.5  MB
get_pipe  host mem before  1794.0  MB
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 43203.13it/s]
get_pipe  run time  15.790169954299927  seconds
get_pipe  cuda mem after  1301.5  MB
get_pipe  cuda mem diff  0.0  MB
get_pipe  host mem after  7648.0  MB
get_pipe  host mem diff  5854.0  MB
<== function  get_pipe  finish run.

==> function  pipe_to_cuda  try to run...
pipe_to_cuda  cuda mem before  1301.5  MB
pipe_to_cuda  host mem before  7648.0  MB
pipe_to_cuda  run time  1.3050522804260254  seconds
pipe_to_cuda  cuda mem after  4055.5  MB
pipe_to_cuda  cuda mem diff  2754.0  MB
pipe_to_cuda  host mem after  8612.0  MB
pipe_to_cuda  host mem diff  964.0  MB
<== function  pipe_to_cuda  finish run.

==> function  config_graph  try to run...
config_graph  cuda mem before  4055.5  MB
config_graph  host mem before  8612.0  MB
config_graph  run time  1.8358230590820312e-05  seconds
config_graph  cuda mem after  4055.5  MB
config_graph  cuda mem diff  0.0  MB
config_graph  host mem after  8612.0  MB
config_graph  host mem diff  0.0  MB
<== function  config_graph  finish run.

sd init time  17.098323345184326 s.
==> function  text_to_image  try to run...
text_to_image  cuda mem before  4055.5  MB
text_to_image  host mem before  8612.0  MB
100%|████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.19it/s]
W20230210 10:15:38.559587 32174 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  10.316318035125732  seconds
text_to_image  cuda mem after  8121.5  MB
text_to_image  cuda mem diff  4066.0  MB
text_to_image  host mem after  9057.0  MB
text_to_image  host mem diff  445.0  MB
<== function  text_to_image  finish run.

==> function  text_to_image  try to run...
text_to_image  cuda mem before  8121.5  MB
text_to_image  host mem before  9057.0  MB
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|████████████████████████████████████████████████| 50/50 [00:15<00:00,  3.17it/s]
text_to_image  run time  26.04283618927002  seconds
text_to_image  cuda mem after  9557.5  MB
text_to_image  cuda mem diff  1436.0  MB
text_to_image  host mem after  7308.0  MB
text_to_image  host mem diff  -1749.0  MB
<== function  text_to_image  finish run.

====> diff  0.0021405134
st init and run time  53.46802878379822 s.
==> function  save_pipe_sch  try to run...
save_pipe_sch  cuda mem before  9557.5  MB
save_pipe_sch  host mem before  7308.0  MB
terminate called after throwing an instance of 'oneflow::RuntimeException'
  what():  Error: out of memory
Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp:209
        OpCallInstructionUtil::Compute(this, instruction): copy:OpCall:s_d2h

  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in Compute
    OpCallInstructionUtil::Compute(this, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 41, in Compute
    AllocateOutputBlobsMemory(op_call_instruction_policy, allocator, instruction)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 89, in AllocateOutputBlobsMemory
    blob_object->TryAllocateBlobBodyMemory(allocator)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 100, in TryAllocateBlobBodyMemory
    allocator->Allocate(&dptr, required_body_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
    AllocateBlockToExtendTotalMem(aligned_size)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
    backend_->Allocate(&mem_ptr, final_allocate_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/ep_backend_host_allocator.cpp", line 25, in Allocate
    ep_device_->AllocPinned(allocation_options_, reinterpret_cast<void**>(mem_ptr), size)
Error Type: oneflow.ErrorProto.runtime_error
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in operator()

Error Type: oneflow.ErrorProto.runtime_error
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted

strint commented 1 year ago

这里的讨论，有人解决了 wsl 的问题： https://github.com/huggingface/diffusers/issues/807#issuecomment-1278397383

可能直接用原生的 Ubuntu 是个更简单的办法 : )

MirrorCY commented 1 year ago

版本 Windows 10 专业工作站版版本号 22H2 安装日期 ‎2022/‎12/‎16 操作系统内部版本 19045.2486 体验 Windows Feature Experience Pack 120.2212.4190.0

看起来我已经是 22H2 了

MirrorCY commented 1 year ago

可能直接用原生的 Ubuntu 是个更简单的办法 : )

是的，这应该能解决此问题但是我只有这一台电脑 😭

strint commented 1 year ago

https://github.com/huggingface/diffusers/issues/807#issuecomment-1278300487

这个作者的描述看起来更完整

MirrorCY commented 1 year ago

wsl --update 报告是最新的，我需要尝试更新到 Windows 11 或者尝试原生 Ubuntu 吗，有需要我可以协助测试。

PS C:\Windows\system32> wsl --update
正在检查更新。
已安装最新版本的适用于 Linux 的 Windows 子系统。

另一个忘记反馈的消息，直接使用单个 OneFlowStableDiffusionPipeline 可以基本满足生产需要，它能跑满我的 24G 显存。

MirrorCY commented 1 year ago

https://github.com/huggingface/diffusers/issues/807#issuecomment-1335534315 好的我跑了这个测试，失败了，问题基本可以确定在 wsl2 上面了

strint commented 1 year ago

@MirrorCY

pin mem 报错在 eager scheduler 的 save。

可以试下这个策略，去掉了 eager scheduler 和 pipe 的 save 和 load，只保留了 graph 的save load：

import time
import os
import gc
import shutil
import unittest
import tempfile

import numpy as np
import oneflow as flow
import oneflow as torch

from diffusers import (
    OneFlowStableDiffusionPipeline as StableDiffusionPipeline,
    OneFlowEulerDiscreteScheduler as EulerDiscreteScheduler,
)
from diffusers import utils

_model_id = "stabilityai/stable-diffusion-2"
_with_image_save = True

def _cost_cnt(fn):
    def new_fn(*args, **kwargs):
        print("==> function ", fn.__name__, " try to run...")
        flow._oneflow_internal.eager.Sync()
        before_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem before ",  before_used, " MB")
        before_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem before ",  before_host_used, " MB")
        start_time = time.time()
        out = fn(*args, **kwargs)
        flow._oneflow_internal.eager.Sync()
        end_time = time.time()
        print(fn.__name__, " run time ", end_time - start_time, " seconds")
        after_used = flow._oneflow_internal.GetCUDAMemoryUsed()
        print(fn.__name__, " cuda mem after ", after_used, " MB")
        print(fn.__name__, " cuda mem diff ", after_used - before_used, " MB")
        after_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
        print(fn.__name__, " host mem after ",  after_host_used, " MB")
        print(fn.__name__, " host mem diff ", after_host_used - before_host_used, " MB")
        print("<== function ", fn.__name__, " finish run.")
        print("")
        return out

    return new_fn

def _reset_session():
    # Close session to avoid the buffer name duplicate error.
    flow.framework.session_context.TryCloseDefaultSession()
    time.sleep(5)
    flow.framework.session_context.NewDefaultSession(flow._oneflow_global_unique_env)

def _test_sd_graph_save_and_load(is_save, graph_save_path, sch_file_path, pipe_file_path):
    if is_save:
        print("\n==> Try to run graph save...")
        _online_mode = False
        _pipe_from_file = False
    else:
        print("\n==> Try to run graph load...")
        _online_mode = True
        _pipe_from_file = False

    total_start_t = time.time()
    start_t = time.time()
    @_cost_cnt
    def get_pipe():
        if _pipe_from_file:
            scheduler = EulerDiscreteScheduler.from_pretrained(sch_file_path, subfolder="scheduler")
            sd_pipe = StableDiffusionPipeline.from_pretrained(
                pipe_file_path, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
                )
        else:
            scheduler = EulerDiscreteScheduler.from_pretrained(_model_id, subfolder="scheduler")
            sd_pipe = StableDiffusionPipeline.from_pretrained(
                _model_id, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
                )
        return scheduler, sd_pipe
    sch, pipe = get_pipe()

    @_cost_cnt
    def pipe_to_cuda():
        cu_pipe = pipe.to("cuda")
        return cu_pipe
    pipe = pipe_to_cuda()

    @_cost_cnt
    def config_graph():
        pipe.set_graph_compile_cache_size(9)
        pipe.enable_graph_share_mem()
    config_graph()

    if not _online_mode:
        pipe.enable_save_graph()
    else:
        @_cost_cnt
        def load_graph():
            assert (os.path.exists(graph_save_path) and os.path.isdir(graph_save_path))
            pipe.load_graph(graph_save_path, compile_unet=True, compile_vae=False)
        load_graph()
    end_t = time.time()
    print("sd init time ", end_t - start_t, 's.')

    @_cost_cnt
    def text_to_image(prompt, image_size, num_images_per_prompt=1, prefix="", with_graph=False):
        if isinstance(image_size, int):
            image_height = image_size
            image_weight = image_size
        elif isinstance(image_size, (tuple, list)):
            assert len(image_size) == 2
            image_height, image_weight = image_size
        else:
            raise ValueError(f"invalie image_size {image_size}")

        cur_generator = torch.Generator("cuda").manual_seed(1024)
        images = pipe(
            prompt,
            height=image_height,
            width=image_weight,
            compile_unet=with_graph,
            compile_vae=False,
            num_images_per_prompt=num_images_per_prompt,
            generator=cur_generator,
            output_type="np",
        ).images

        if _with_image_save:
            for i, image in enumerate(images):
                pipe.numpy_to_pil(image)[0].save(f"{prefix}{prompt}_{image_height}x{image_weight}_{i}-with_graph_{str(with_graph)}.png")

        return images

    prompt = "a photo of an astronaut riding a horse on mars"

    #sizes = [1024, 896, 768]
    sizes = [1024]
    for i in sizes:
        for j in sizes:
            no_g_images = text_to_image(prompt, (i, j), prefix=f"is_save_{str(is_save)}-", with_graph=False)
            with_g_images = text_to_image(prompt, (i, j), prefix=f"is_save_{str(is_save)}-", with_graph=True)
            assert len(no_g_images) == len(with_g_images)
            for img_idx in range(len(no_g_images)):
                print("====> diff ", np.abs(no_g_images[img_idx] - with_g_images[img_idx]).mean())
                assert np.abs(no_g_images[img_idx] - with_g_images[img_idx]).mean() < 1e-2
    total_end_t = time.time()
    print("st init and run time ", total_end_t - total_start_t, 's.')

    @_cost_cnt
    def save_pipe_sch():
        return
        pipe.save_pretrained(pipe_file_path)
        sch.save_pretrained(sch_file_path)

    @_cost_cnt
    def save_graph():
        assert os.path.exists(graph_save_path) and os.path.isdir(graph_save_path)
        pipe.save_graph(graph_save_path)

    if not _online_mode:
        save_pipe_sch()
        save_graph()

class OneFlowPipeLineGraphSaveLoadTests(unittest.TestCase):
    def tearDown(self):
        # clean up the VRAM after each test
        super().tearDown()
        gc.collect()
        torch.cuda.empty_cache()

    def test_sd_graph_save_and_load(self):
        with tempfile.TemporaryDirectory() as f0:
            with tempfile.TemporaryDirectory() as f1:
                with tempfile.TemporaryDirectory() as f2:
                    _test_sd_graph_save_and_load(True, f0 ,f1, f2)
                    _reset_session()
                    _test_sd_graph_save_and_load(False, f0, f1, f2)

if __name__ == "__main__":
    unittest.main()

strint commented 1 year ago

huggingface#807 (comment) 好的我跑了这个测试，失败了，问题基本可以确定在 wsl2 上面了

赞，这个测试的确可以确认问题。

MirrorCY commented 1 year ago

@MirrorCY pin mem 报错在 eager scheduler 的 save。可以试下这个策略，去掉了 eager scheduler 和 pipe 的 save 和 load，只保留了 graph 的save load：

python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped

==> Try to run graph save...
==> function  get_pipe  try to run...
get_pipe  cuda mem before  1301.5  MB
get_pipe  host mem before  1787.0  MB
Fetching 12 files: 100%|██████████████████████████| 12/12 [00:00<00:00, 74676.04it/s]
get_pipe  run time  13.726568222045898  seconds
get_pipe  cuda mem after  1301.5  MB
get_pipe  cuda mem diff  0.0  MB
get_pipe  host mem after  7384.0  MB
get_pipe  host mem diff  5597.0  MB
<== function  get_pipe  finish run.

==> function  pipe_to_cuda  try to run...
pipe_to_cuda  cuda mem before  1301.5  MB
pipe_to_cuda  host mem before  7384.0  MB
pipe_to_cuda  run time  0.8773036003112793  seconds
pipe_to_cuda  cuda mem after  4061.5  MB
pipe_to_cuda  cuda mem diff  2760.0  MB
pipe_to_cuda  host mem after  8334.0  MB
pipe_to_cuda  host mem diff  950.0  MB
<== function  pipe_to_cuda  finish run.

==> function  config_graph  try to run...
config_graph  cuda mem before  4061.5  MB
config_graph  host mem before  8334.0  MB
config_graph  run time  2.8848648071289062e-05  seconds
config_graph  cuda mem after  4061.5  MB
config_graph  cuda mem diff  0.0  MB
config_graph  host mem after  8334.0  MB
config_graph  host mem diff  0.0  MB
<== function  config_graph  finish run.

sd init time  14.607499122619629 s.
==> function  text_to_image  try to run...
text_to_image  cuda mem before  4061.5  MB
text_to_image  host mem before  8334.0  MB
100%|████████████████████████████████████████████████| 50/50 [00:09<00:00,  5.35it/s]
W20230210 12:14:05.907799 19811 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image  run time  10.030009269714355  seconds
text_to_image  cuda mem after  8125.5  MB
text_to_image  cuda mem diff  4064.0  MB
text_to_image  host mem after  8788.0  MB
text_to_image  host mem diff  454.0  MB
<== function  text_to_image  finish run.

==> function  text_to_image  try to run...
text_to_image  cuda mem before  8125.5  MB
text_to_image  host mem before  8788.0  MB
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
  warnings.warn(
100%|████████████████████████████████████████████████| 50/50 [00:17<00:00,  2.93it/s]
text_to_image  run time  25.487446784973145  seconds
text_to_image  cuda mem after  9561.5  MB
text_to_image  cuda mem diff  1436.0  MB
text_to_image  host mem after  7332.0  MB
text_to_image  host mem diff  -1456.0  MB
<== function  text_to_image  finish run.

====> diff  0.0023374576
st init and run time  50.13253879547119 s.
==> function  save_pipe_sch  try to run...
save_pipe_sch  cuda mem before  9561.5  MB
save_pipe_sch  host mem before  7332.0  MB
save_pipe_sch  run time  1.1682510375976562e-05  seconds
save_pipe_sch  cuda mem after  9561.5  MB
save_pipe_sch  cuda mem diff  0.0  MB
save_pipe_sch  host mem after  7332.0  MB
save_pipe_sch  host mem diff  0.0  MB
<== function  save_pipe_sch  finish run.

==> function  save_graph  try to run...
save_graph  cuda mem before  9561.5  MB
save_graph  host mem before  7332.0  MB
terminate called after throwing an instance of 'oneflow::RuntimeException'
  what():  Error: out of memory
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted

strint commented 1 year ago

terminate called after throwing an instance of 'oneflow::RuntimeException' what(): Error: out of memory You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error. Aborted
@MirrorCY  多谢，看起来绕不过去。都有 cuda to cpu 的操作。

MirrorCY commented 1 year ago

感觉此问题可以暂时搁置一会了，上游问题导致的~

strint commented 1 year ago

好的，那暂时 close 了。

siliconflow / onediff

test_pipelines_oneflow_graph_load out of host memory error in WSL #95