Closed strint closed 1 year ago
@MirrorCY
可以更新下 _cost_cnt,新增了 host mem 统计:
def _cost_cnt(fn):
def new_fn(*args, **kwargs):
print("==> function ", fn.__name__, " try to run...")
flow._oneflow_internal.eager.Sync()
before_used = flow._oneflow_internal.GetCUDAMemoryUsed()
print(fn.__name__, " cuda mem before ", before_used, " MB")
before_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
print(fn.__name__, " host mem before ", before_host_used, " MB")
start_time = time.time()
out = fn(*args, **kwargs)
flow._oneflow_internal.eager.Sync()
end_time = time.time()
print(fn.__name__, " run time ", end_time - start_time, " seconds")
after_used = flow._oneflow_internal.GetCUDAMemoryUsed()
print(fn.__name__, " cuda mem after ", after_used, " MB")
print(fn.__name__, " cuda mem diff ", after_used - before_used, " MB")
after_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
print(fn.__name__, " host mem after ", after_host_used, " MB")
print(fn.__name__, " host mem diff ", after_host_used - before_host_used, " MB")
print("<== function ", fn.__name__, " finish run.")
print("")
return out
return new_fn
这是我在 save 阶段的日志
==> Try to run graph save...
==> function get_pipe try to run...
get_pipe cuda mem before 652.0 MB
get_pipe host mem before 2163.0 MB
Fetching 12 files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 18606.89it/s]
get_pipe run time 34.62615132331848 seconds
get_pipe cuda mem after 652.0 MB
get_pipe cuda mem diff 0.0 MB
get_pipe host mem after 8467.0 MB
get_pipe host mem diff 6304.0 MB
<== function get_pipe finish run.
==> function pipe_to_cuda try to run...
pipe_to_cuda cuda mem before 652.0 MB
pipe_to_cuda host mem before 8467.0 MB
pipe_to_cuda run time 1.7569012641906738 seconds
pipe_to_cuda cuda mem after 3770.0 MB
pipe_to_cuda cuda mem diff 3118.0 MB
pipe_to_cuda host mem after 9470.0 MB
pipe_to_cuda host mem diff 1003.0 MB
<== function pipe_to_cuda finish run.
==> function config_graph try to run...
config_graph cuda mem before 3770.0 MB
config_graph host mem before 9470.0 MB
config_graph run time 6.890296936035156e-05 seconds
config_graph cuda mem after 3770.0 MB
config_graph cuda mem diff 0.0 MB
config_graph host mem after 9470.0 MB
config_graph host mem diff 0.0 MB
<== function config_graph finish run.
sd init time 36.38862633705139 s.
==> function text_to_image try to run...
text_to_image cuda mem before 3770.0 MB
text_to_image host mem before 9470.0 MB
100%|█████| 50/50 [00:13<00:00, 3.62it/s]
W20230210 01:54:00.468230 291033 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image run time 15.075793743133545 seconds
text_to_image cuda mem after 7854.0 MB
text_to_image cuda mem diff 4084.0 MB
text_to_image host mem after 10003.0 MB
text_to_image host mem diff 533.0 MB
<== function text_to_image finish run.
==> function text_to_image try to run...
text_to_image cuda mem before 7854.0 MB
text_to_image host mem before 10003.0 MB
/home/xuxiaoyu/dev/oneflow/python/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
warnings.warn(
100%|█████| 50/50 [00:13<00:00, 3.82it/s]
text_to_image run time 19.789249181747437 seconds
text_to_image cuda mem after 8202.0 MB
text_to_image cuda mem diff 348.0 MB
text_to_image host mem after 7869.0 MB
text_to_image host mem diff -2134.0 MB
<== function text_to_image finish run.
====> diff 0.0013520131
st init and run time 71.27231931686401 s.
==> function save_pipe_sch try to run...
save_pipe_sch cuda mem before 8202.0 MB
save_pipe_sch host mem before 7869.0 MB
save_pipe_sch run time 6.8812150955200195 seconds
save_pipe_sch cuda mem after 8204.0 MB
save_pipe_sch cuda mem diff 2.0 MB
save_pipe_sch host mem after 9686.0 MB
save_pipe_sch host mem diff 1817.0 MB
<== function save_pipe_sch finish run.
==> function save_graph try to run...
save_graph cuda mem before 8204.0 MB
save_graph host mem before 9686.0 MB
save_graph run time 3.215217351913452 seconds
save_graph cuda mem after 8206.0 MB
save_graph cuda mem diff 2.0 MB
save_graph host mem after 9715.0 MB
save_graph host mem diff 29.0 MB
<== function save_graph finish run.
早上好!这就爬起来开测~
看起来是 wsl 组合 cuda 使用时,存在 Pinned system memory 大小限制:
看起来就是这个问题了,我翻一翻相关内容找找 workaround
对了,这里是新日志
ubuntu@DESKTOP-531RKJN:~$ python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped
==> Try to run graph save...
==> function get_pipe try to run...
get_pipe cuda mem before 1301.5 MB
get_pipe host mem before 1794.0 MB
Fetching 12 files: 100%|██████████████████████████████████████████████████████████████████████| 12/12 [00:00<00:00, 43203.13it/s]
get_pipe run time 15.790169954299927 seconds
get_pipe cuda mem after 1301.5 MB
get_pipe cuda mem diff 0.0 MB
get_pipe host mem after 7648.0 MB
get_pipe host mem diff 5854.0 MB
<== function get_pipe finish run.
==> function pipe_to_cuda try to run...
pipe_to_cuda cuda mem before 1301.5 MB
pipe_to_cuda host mem before 7648.0 MB
pipe_to_cuda run time 1.3050522804260254 seconds
pipe_to_cuda cuda mem after 4055.5 MB
pipe_to_cuda cuda mem diff 2754.0 MB
pipe_to_cuda host mem after 8612.0 MB
pipe_to_cuda host mem diff 964.0 MB
<== function pipe_to_cuda finish run.
==> function config_graph try to run...
config_graph cuda mem before 4055.5 MB
config_graph host mem before 8612.0 MB
config_graph run time 1.8358230590820312e-05 seconds
config_graph cuda mem after 4055.5 MB
config_graph cuda mem diff 0.0 MB
config_graph host mem after 8612.0 MB
config_graph host mem diff 0.0 MB
<== function config_graph finish run.
sd init time 17.098323345184326 s.
==> function text_to_image try to run...
text_to_image cuda mem before 4055.5 MB
text_to_image host mem before 8612.0 MB
100%|████████████████████████████████████████████████| 50/50 [00:09<00:00, 5.19it/s]
W20230210 10:15:38.559587 32174 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image run time 10.316318035125732 seconds
text_to_image cuda mem after 8121.5 MB
text_to_image cuda mem diff 4066.0 MB
text_to_image host mem after 9057.0 MB
text_to_image host mem diff 445.0 MB
<== function text_to_image finish run.
==> function text_to_image try to run...
text_to_image cuda mem before 8121.5 MB
text_to_image host mem before 9057.0 MB
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
warnings.warn(
100%|████████████████████████████████████████████████| 50/50 [00:15<00:00, 3.17it/s]
text_to_image run time 26.04283618927002 seconds
text_to_image cuda mem after 9557.5 MB
text_to_image cuda mem diff 1436.0 MB
text_to_image host mem after 7308.0 MB
text_to_image host mem diff -1749.0 MB
<== function text_to_image finish run.
====> diff 0.0021405134
st init and run time 53.46802878379822 s.
==> function save_pipe_sch try to run...
save_pipe_sch cuda mem before 9557.5 MB
save_pipe_sch host mem before 7308.0 MB
terminate called after throwing an instance of 'oneflow::RuntimeException'
what(): Error: out of memory
Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp:209
OpCallInstructionUtil::Compute(this, instruction): copy:OpCall:s_d2h
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in Compute
OpCallInstructionUtil::Compute(this, instruction)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 41, in Compute
AllocateOutputBlobsMemory(op_call_instruction_policy, allocator, instruction)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 89, in AllocateOutputBlobsMemory
blob_object->TryAllocateBlobBodyMemory(allocator)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 100, in TryAllocateBlobBodyMemory
allocator->Allocate(&dptr, required_body_bytes)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
AllocateBlockToExtendTotalMem(aligned_size)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
backend_->Allocate(&mem_ptr, final_allocate_bytes)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/ep_backend_host_allocator.cpp", line 25, in Allocate
ep_device_->AllocPinned(allocation_options_, reinterpret_cast<void**>(mem_ptr), size)
Error Type: oneflow.ErrorProto.runtime_error
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/op_call_instruction_policy.cpp", line 209, in operator()
Error Type: oneflow.ErrorProto.runtime_error
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted
这里的讨论,有人解决了 wsl 的问题: https://github.com/huggingface/diffusers/issues/807#issuecomment-1278397383
可能直接用原生的 Ubuntu 是个更简单的办法 : )
版本 Windows 10 专业工作站版 版本号 22H2 安装日期 2022/12/16 操作系统内部版本 19045.2486 体验 Windows Feature Experience Pack 120.2212.4190.0
看起来我已经是 22H2 了
可能直接用原生的 Ubuntu 是个更简单的办法 : )
是的,这应该能解决此问题但是我只有这一台电脑 😭
wsl --update
报告是最新的,我需要尝试更新到 Windows 11 或者尝试原生 Ubuntu 吗,有需要我可以协助测试。
PS C:\Windows\system32> wsl --update
正在检查更新。
已安装最新版本的适用于 Linux 的 Windows 子系统。
另一个忘记反馈的消息,直接使用单个 OneFlowStableDiffusionPipeline 可以基本满足生产需要,它能跑满我的 24G 显存。
https://github.com/huggingface/diffusers/issues/807#issuecomment-1335534315 好的我跑了这个测试,失败了,问题基本可以确定在 wsl2 上面了
@MirrorCY
pin mem 报错在 eager scheduler 的 save。
可以试下这个策略,去掉了 eager scheduler 和 pipe 的 save 和 load,只保留了 graph 的save load:
import time
import os
import gc
import shutil
import unittest
import tempfile
import numpy as np
import oneflow as flow
import oneflow as torch
from diffusers import (
OneFlowStableDiffusionPipeline as StableDiffusionPipeline,
OneFlowEulerDiscreteScheduler as EulerDiscreteScheduler,
)
from diffusers import utils
_model_id = "stabilityai/stable-diffusion-2"
_with_image_save = True
def _cost_cnt(fn):
def new_fn(*args, **kwargs):
print("==> function ", fn.__name__, " try to run...")
flow._oneflow_internal.eager.Sync()
before_used = flow._oneflow_internal.GetCUDAMemoryUsed()
print(fn.__name__, " cuda mem before ", before_used, " MB")
before_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
print(fn.__name__, " host mem before ", before_host_used, " MB")
start_time = time.time()
out = fn(*args, **kwargs)
flow._oneflow_internal.eager.Sync()
end_time = time.time()
print(fn.__name__, " run time ", end_time - start_time, " seconds")
after_used = flow._oneflow_internal.GetCUDAMemoryUsed()
print(fn.__name__, " cuda mem after ", after_used, " MB")
print(fn.__name__, " cuda mem diff ", after_used - before_used, " MB")
after_host_used = flow._oneflow_internal.GetCPUMemoryUsed()
print(fn.__name__, " host mem after ", after_host_used, " MB")
print(fn.__name__, " host mem diff ", after_host_used - before_host_used, " MB")
print("<== function ", fn.__name__, " finish run.")
print("")
return out
return new_fn
def _reset_session():
# Close session to avoid the buffer name duplicate error.
flow.framework.session_context.TryCloseDefaultSession()
time.sleep(5)
flow.framework.session_context.NewDefaultSession(flow._oneflow_global_unique_env)
def _test_sd_graph_save_and_load(is_save, graph_save_path, sch_file_path, pipe_file_path):
if is_save:
print("\n==> Try to run graph save...")
_online_mode = False
_pipe_from_file = False
else:
print("\n==> Try to run graph load...")
_online_mode = True
_pipe_from_file = False
total_start_t = time.time()
start_t = time.time()
@_cost_cnt
def get_pipe():
if _pipe_from_file:
scheduler = EulerDiscreteScheduler.from_pretrained(sch_file_path, subfolder="scheduler")
sd_pipe = StableDiffusionPipeline.from_pretrained(
pipe_file_path, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
)
else:
scheduler = EulerDiscreteScheduler.from_pretrained(_model_id, subfolder="scheduler")
sd_pipe = StableDiffusionPipeline.from_pretrained(
_model_id, scheduler=scheduler, revision="fp16", torch_dtype=torch.float16
)
return scheduler, sd_pipe
sch, pipe = get_pipe()
@_cost_cnt
def pipe_to_cuda():
cu_pipe = pipe.to("cuda")
return cu_pipe
pipe = pipe_to_cuda()
@_cost_cnt
def config_graph():
pipe.set_graph_compile_cache_size(9)
pipe.enable_graph_share_mem()
config_graph()
if not _online_mode:
pipe.enable_save_graph()
else:
@_cost_cnt
def load_graph():
assert (os.path.exists(graph_save_path) and os.path.isdir(graph_save_path))
pipe.load_graph(graph_save_path, compile_unet=True, compile_vae=False)
load_graph()
end_t = time.time()
print("sd init time ", end_t - start_t, 's.')
@_cost_cnt
def text_to_image(prompt, image_size, num_images_per_prompt=1, prefix="", with_graph=False):
if isinstance(image_size, int):
image_height = image_size
image_weight = image_size
elif isinstance(image_size, (tuple, list)):
assert len(image_size) == 2
image_height, image_weight = image_size
else:
raise ValueError(f"invalie image_size {image_size}")
cur_generator = torch.Generator("cuda").manual_seed(1024)
images = pipe(
prompt,
height=image_height,
width=image_weight,
compile_unet=with_graph,
compile_vae=False,
num_images_per_prompt=num_images_per_prompt,
generator=cur_generator,
output_type="np",
).images
if _with_image_save:
for i, image in enumerate(images):
pipe.numpy_to_pil(image)[0].save(f"{prefix}{prompt}_{image_height}x{image_weight}_{i}-with_graph_{str(with_graph)}.png")
return images
prompt = "a photo of an astronaut riding a horse on mars"
#sizes = [1024, 896, 768]
sizes = [1024]
for i in sizes:
for j in sizes:
no_g_images = text_to_image(prompt, (i, j), prefix=f"is_save_{str(is_save)}-", with_graph=False)
with_g_images = text_to_image(prompt, (i, j), prefix=f"is_save_{str(is_save)}-", with_graph=True)
assert len(no_g_images) == len(with_g_images)
for img_idx in range(len(no_g_images)):
print("====> diff ", np.abs(no_g_images[img_idx] - with_g_images[img_idx]).mean())
assert np.abs(no_g_images[img_idx] - with_g_images[img_idx]).mean() < 1e-2
total_end_t = time.time()
print("st init and run time ", total_end_t - total_start_t, 's.')
@_cost_cnt
def save_pipe_sch():
return
pipe.save_pretrained(pipe_file_path)
sch.save_pretrained(sch_file_path)
@_cost_cnt
def save_graph():
assert os.path.exists(graph_save_path) and os.path.isdir(graph_save_path)
pipe.save_graph(graph_save_path)
if not _online_mode:
save_pipe_sch()
save_graph()
class OneFlowPipeLineGraphSaveLoadTests(unittest.TestCase):
def tearDown(self):
# clean up the VRAM after each test
super().tearDown()
gc.collect()
torch.cuda.empty_cache()
def test_sd_graph_save_and_load(self):
with tempfile.TemporaryDirectory() as f0:
with tempfile.TemporaryDirectory() as f1:
with tempfile.TemporaryDirectory() as f2:
_test_sd_graph_save_and_load(True, f0 ,f1, f2)
_reset_session()
_test_sd_graph_save_and_load(False, f0, f1, f2)
if __name__ == "__main__":
unittest.main()
huggingface#807 (comment) 好的我跑了这个测试,失败了,问题基本可以确定在 wsl2 上面了
赞,这个测试的确可以确认问题。
@MirrorCY pin mem 报错在 eager scheduler 的 save。 可以试下这个策略,去掉了 eager scheduler 和 pipe 的 save 和 load,只保留了 graph 的save load:
python3 diffusers/tests/test_pipelines_oneflow_graph_load.py
libibverbs not available, ibv_fork_init skipped
==> Try to run graph save...
==> function get_pipe try to run...
get_pipe cuda mem before 1301.5 MB
get_pipe host mem before 1787.0 MB
Fetching 12 files: 100%|██████████████████████████| 12/12 [00:00<00:00, 74676.04it/s]
get_pipe run time 13.726568222045898 seconds
get_pipe cuda mem after 1301.5 MB
get_pipe cuda mem diff 0.0 MB
get_pipe host mem after 7384.0 MB
get_pipe host mem diff 5597.0 MB
<== function get_pipe finish run.
==> function pipe_to_cuda try to run...
pipe_to_cuda cuda mem before 1301.5 MB
pipe_to_cuda host mem before 7384.0 MB
pipe_to_cuda run time 0.8773036003112793 seconds
pipe_to_cuda cuda mem after 4061.5 MB
pipe_to_cuda cuda mem diff 2760.0 MB
pipe_to_cuda host mem after 8334.0 MB
pipe_to_cuda host mem diff 950.0 MB
<== function pipe_to_cuda finish run.
==> function config_graph try to run...
config_graph cuda mem before 4061.5 MB
config_graph host mem before 8334.0 MB
config_graph run time 2.8848648071289062e-05 seconds
config_graph cuda mem after 4061.5 MB
config_graph cuda mem diff 0.0 MB
config_graph host mem after 8334.0 MB
config_graph host mem diff 0.0 MB
<== function config_graph finish run.
sd init time 14.607499122619629 s.
==> function text_to_image try to run...
text_to_image cuda mem before 4061.5 MB
text_to_image host mem before 8334.0 MB
100%|████████████████████████████████████████████████| 50/50 [00:09<00:00, 5.35it/s]
W20230210 12:14:05.907799 19811 cudnn_conv_util.cpp:102] Currently available alogrithm (algo=1, require memory=7472256, idx=1) meeting requirments (max_workspace_size=1073741824, determinism=0) is not fastest. Fastest algorithm (1) requires memory 1074922512
text_to_image run time 10.030009269714355 seconds
text_to_image cuda mem after 8125.5 MB
text_to_image cuda mem diff 4064.0 MB
text_to_image host mem after 8788.0 MB
text_to_image host mem diff 454.0 MB
<== function text_to_image finish run.
==> function text_to_image try to run...
text_to_image cuda mem before 8125.5 MB
text_to_image host mem before 8788.0 MB
/home/ubuntu/.local/lib/python3.8/site-packages/oneflow/nn/modules/module.py:152: UserWarning: Interpolate() is called in a nn.Graph, but not registered into a nn.Graph.
warnings.warn(
100%|████████████████████████████████████████████████| 50/50 [00:17<00:00, 2.93it/s]
text_to_image run time 25.487446784973145 seconds
text_to_image cuda mem after 9561.5 MB
text_to_image cuda mem diff 1436.0 MB
text_to_image host mem after 7332.0 MB
text_to_image host mem diff -1456.0 MB
<== function text_to_image finish run.
====> diff 0.0023374576
st init and run time 50.13253879547119 s.
==> function save_pipe_sch try to run...
save_pipe_sch cuda mem before 9561.5 MB
save_pipe_sch host mem before 7332.0 MB
save_pipe_sch run time 1.1682510375976562e-05 seconds
save_pipe_sch cuda mem after 9561.5 MB
save_pipe_sch cuda mem diff 0.0 MB
save_pipe_sch host mem after 7332.0 MB
save_pipe_sch host mem diff 0.0 MB
<== function save_pipe_sch finish run.
==> function save_graph try to run...
save_graph cuda mem before 9561.5 MB
save_graph host mem before 7332.0 MB
terminate called after throwing an instance of 'oneflow::RuntimeException'
what(): Error: out of memory
You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error.
Aborted
terminate called after throwing an instance of 'oneflow::RuntimeException' what(): Error: out of memory You can set ONEFLOW_DEBUG or ONEFLOW_PYTHON_STACK_GETTER to 1 to get the Python stack of the error. Aborted
@MirrorCY 多谢,看起来绕不过去。都有 cuda to cpu 的操作。
感觉此问题可以暂时搁置一会了,上游问题导致的~
好的,那暂时 close 了。
The running environment is wsl2 Ubuntu 20.04, neither the host nor wsl2 is running any other CUDA programs.
Originally posted by @MirrorCY in https://github.com/Oneflow-Inc/diffusers/issues/75#issuecomment-1424482749