jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
2.28k stars 219 forks source link

OOM , What are the memory requirements to run 768 quality #175

Open AA-Developer opened 2 days ago

AA-Developer commented 2 days ago

i have 24 GB vram (rtx 3090) also i use cpu_offloading = True

[INFO] Starting text-to-video generation...
 19%|███████████████████████▊                                                                                                       | 3/16 [01:37<07:00, 32.34s/it]
[ERROR] Error during text-to-video generation: CUDA out of memory. Tried to allocate 9.07 GiB. GPU 0 has a total capacty of 23.68 GiB of which 7.80 GiB is free. Process 1566591 has 15.87 GiB memory in use. Of the allocated memory 5.67 GiB is allocated by PyTorch, and 9.89 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Traceback (most recent call last):
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/queueing.py", line 536, in process_events
    response = await route_utils.call_process_api(
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/route_utils.py", line 322, in call_process_api
    output = await app.get_blocks().process_api(
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/blocks.py", line 1945, in process_api
    data = await self.postprocess_data(block_fn, result["prediction"], state)
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/blocks.py", line 1770, in postprocess_data
    outputs_cached = await processing_utils.async_move_files_to_cache(
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/processing_utils.py", line 485, in async_move_files_to_cache
    return await client_utils.async_traverse(
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio_client/utils.py", line 1003, in async_traverse
    new_obj[key] = await async_traverse(value, func, is_root)
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio_client/utils.py", line 999, in async_traverse
    return await func(json_obj)
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/processing_utils.py", line 447, in _move_to_cache
    elif utils.is_static_file(payload):
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/utils.py", line 1136, in is_static_file
    return _is_static_file(file_path, _StaticFiles.all_paths)
  File "/root/anaconda3/envs/pyramid/lib/python3.8/site-packages/gradio/utils.py", line 1149, in _is_static_file
    if not file_path.exists():
  File "/root/anaconda3/envs/pyramid/lib/python3.8/pathlib.py", line 1407, in exists
    self.stat()
  File "/root/anaconda3/envs/pyramid/lib/python3.8/pathlib.py", line 1198, in stat
    return self._accessor.stat(self)
OSError: [Errno 36] File name too long: 'Error during video generation: CUDA out of memory. Tried to allocate 9.07 GiB. GPU 0 has a total capacty of 23.68 GiB of which 7.80 GiB is free. Process 1566591 has 15.87 GiB memory in use. Of the allocated memory 5.67 GiB is allocated by PyTorch, and 9.89 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF'
[DEBUG] generate_text_to_video called.
feifeiobama commented 2 days ago

We have not extensively tested the use of the new model on Gradio, could you try the Jupyter notebook instead?

AA-Developer commented 2 days ago

yes its Jupyter notebook i use it with Gradio

cocktailpeanut commented 2 days ago

I have a 4090 and it's eating up the entire 24G VRAM.

In my case it doesn't crash, but obviously since it exceeds the upper bound it takes very long to generate with the 768p model (around 18 minutes). I get 116.60s/it.

For the record, the 384p model works just fine, using just 6~8G VRAM.

wx331406 commented 2 days ago

我有一台 4090,它占用了整个 24G VRAM。

在我的情况下它不会崩溃,但显然由于它超出了上限,使用 768p 模型生成需要很长时间(大约 18 分钟)。我得到 116.60s/it。

记录显示,384p 型号运行良好,仅使用 6~8G VRAM。

Why does it only take me a few minutes to use the SD3 model 768P's? I have a 3090/24 video card and I haven't tried the 768P for the FX model yet, it's hard to imagine the difference being that big.

el3ctr0de commented 2 days ago

Got the OOM error in the gradio app on a L4 24gb, 384p works, 720p gives the OOM errors. also got the error on the jupyter script: 6%|▋ | 1/16 [00:42<10:34, 42.32s/it]

OutOfMemoryError Traceback (most recent call last) Cell In[6], line 15 12 # Noting that, for the 384p version, only supports maximum 5s generation (temp = 16) 14 with torch.no_grad(), torch.cuda.amp.autocast(enabled=True if model_dtype != 'fp32' else False, dtype=torch_dtype): ---> 15 frames = model.generate( 16 prompt=prompt, 17 num_inference_steps=[20, 20, 20], 18 video_num_inference_steps=[10, 10, 10], 19 height=height, 20 width=width, 21 temp=temp, 22 guidance_scale=7.0, # The guidance for the first frame, set it to 7 for 384p variant 23 video_guidance_scale=5.0, # The guidance for the other video latent 24 output_type="pil", 25 save_memory=True, # If you have enough GPU memory, set it to False to improve vae decoding speed 26 ) 28 export_to_video(frames, "./text_to_video_sample.mp4", fps=24) 29 show_video(None, "./text_to_video_sample.mp4", "70%")

File /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, kwargs) 112 @functools.wraps(func) 113 def decorate_context(*args, *kwargs): 114 with ctx_factory(): --> 115 return func(args, kwargs) ... 365 ) 366 stage_hidden_states = stage_hidden_states.transpose(1, 2).flatten(2, 3) # [bs, tot_seq, dim] 368 output_encoder_hidden_list.append(stage_hidden_states[:, :encoder_length])

OutOfMemoryError: CUDA out of memory. Tried to allocate 6.81 GiB. GPU 0 has a total capacty of 22.17 GiB of which 6.54 GiB is free. Process 655789 has 15.62 GiB memory in use. Of the allocated memory 14.83 GiB is allocated by PyTorch, and 576.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.

wx331406 commented 2 days ago

在 L4 24gb 上的 gradio 应用程序中出现 OOM 错误,384p 可以运行,720p 会出现 OOM 错误。jupyter 脚本也出现错误:

6%|▋ | 1/16 [00:42<10:34, 42.32s/it] OutOfMemoryError Traceback(最近一次调用最后一次) Cell In[6],第 15 行 12 # 请注意,对于 384p 版本,仅支持最大 5s 生成(temp = 16) 14 with torch.no_grad(), torch.cuda.amp.autocast(enabled=True if model_dtype != 'fp32' else False, dtype=torch_dtype): ---> 15 frames = model.generate( 16 prompt=prompt, 17 num_inference_steps=[20, 20, 20], 18 video_num_inference_steps=[10, 10, 10], 19 height=height, 20 width=width, 21 temp=temp, 22 guide_scale=7.0, # 第一帧的指导,对于 384p 变体将其设置为 7 23 video_guidance_scale=5.0, # 其他视频潜在的指导 24 output_type="pil", 25 save_memory=True, # 如果您有足够的 GPU 内存,请将其设置为False以提高 vae 解码速度 26 ) 28 export_to_video(frames, "./text_to_video_sample.mp4", fps=24) 29 show_video(None, "./text_to_video_sample.mp4", "70%")

文件 /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/utils/_contextlib.py:115,在 context_decorator..decorate_context(*args, kwargs) 中 112 @functools.wraps(func) 113 def decorate_context(*args, *kwargs): 114 使用 ctx_factory(): --> 115 return func(args, kwargs) ... 365 ) 366 stagehidden​​states = stagehidden​​states.transpose(1, 2).flatten(2, 3) # [bs, tot_seq, dim] 368 output_encoderhidden​​list.append(stagehidden​​states[:, :encoder_length])

OutOfMemoryError:CUDA 内存不足。尝试分配 6.81 GiB。GPU 0 的总容量为 22.17 GiB,其中 6.54 GiB 是空闲的。进程 655789 使用了 15.62 GiB 内存。在分配的内存中,14.83 GiB 由 PyTorch 分配,576.29 MiB 由 PyTorch 保留但未分配。如果保留但未分配的内存很大,请尝试设置 max_split_size_mb 以避免碎片化。 My 3090/24GB can run what about your prompt of insufficient GPU memory, although my 24GB RAM can run but I also get a system reboot due to insufficient GPU power.