Open dillfrescott opened 1 month ago
Even using app.py
, it hits step 14 and starts flooding into shared ram, halting progress.
I'm unable to reproduce using your shared script and the latest commit. Make sure your repo is up to date as the fix for this just got merged yesterday. Disabling system memory fallback (I believe this is something you can do in NVIDIA control panel, though I don't use windows) will also fix this, as the gpu will recognize its time to deallocate when it reaches the max
Okay thank you!
I think that was my issue. I disabled the sysmem fallback and it seems to be helping.
nevermind. Now it says:
Traceback (most recent call last):
File "text.py", line 23, in <module>
frames = model.generate(
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\pyramid_dit_for_video_gen_pipeline.py", line 703, in generate
intermed_latents = self.generate_one_unit(
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\pyramid_dit_for_video_gen_pipeline.py", line 285, in generate_one_unit
noise_pred = self.dit(
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_pyramid_mmdit.py", line 479, in forward
encoder_hidden_states, hidden_states = block(
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 640, in forward
attn_output, context_attn_output = self.attn(
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 548, in forward
hidden_states, encoder_hidden_states = self.var_len_attn(
File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 308, in __call__
stage_hidden_states = F.scaled_dot_product_attention(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.68 GiB. GPU 0 has a total capacty of 23.99 GiB of which 12.48 GiB is free. Of the allocated memory 6.10 GiB is allocated by PyTorch, and 3.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I dont understand why it is doing this still
hmm. Do you have the temp, res, or something else set super high? It shouldn't be allocating 14 gb with the script you provided in the original post
I tried app.py as well and it got to step 17 and crashed (OOM). I have not modified anything.
I set the steps to 31 because I want a 10 second video, but everything else is the default.
I get a crash after step 6. Radeon 7900XTX (24 GB VRAM). 80 GB system RAM.
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 8.46 GiB. GPU 0 has a total capacity of 23.98 GiB of which 7.98 GiB is free. Of the allocated memory 14.00 GiB is allocated by PyTorch, and 1.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
i've got the same situation as reported above: memory usage goes up and down in line with the gc collecting and cuda cache cleaning, but on every step it goes a bit higher until OOM at step 17 (so it's not a VAE issue).
i've got 3090 24gb, windows 10.
my configuration is also the same: diffusion_transformer_768p
, 10 sec (temp=31) generation, dtype bf16, cpu_offloading=True
(in generate
function), sequential_cpu_offload
disabled.
seriously wondering what's the difference with your settings that you managed to fit it within 12gb vram
one guess for such difference in behaviour was that gc.collect()
or torch.cuda.empty_cache()
worked differently on linux and windows, not cleaning up vram as effectively on the latter. however, i haven't faced such issues for quite a few years of active dealing with this optimization technique.
yet i see that the generation loop populates generated_latents_list
with a new item from intermed_latents
on every step, so this vram occupation growth looks quite natural; so i'm more curious how it may stay the same on your side, where do those new generated latents go then?
UPD: more careful observation has shown that the amount of vram required for a self.dit
run within generate_one_unit
gradually increases with the increase of the input latents shapes.
e.g. 3rd stage takes single [2,16,1,96,160]
(~7GB cache) on unit_index
0, but [2,16,7,24,40]
+ [2,16,1,48,80]
+ [2,16,1,96,160]
+ [2,16,1,96,160]
(~18gb cache) on unit_index
9.
this growth looks pretty natural and therefore inevitable; if it's not like that on your side - there should be a reason.
This seems about in line with what I expect. The latent growth is more apparent when using sequential CPU offload, but even then doesn't seem to get nearly that large. I expect windows has some sort of implementation of the cache that acts different. On Linux, I'm able to run a 10s (temp 31) video with only 12gb of vram available. I'm not quite sure where the drastically increasing vram is coming from though, as a [2,16,1,96,160] tensor at fp32 should only take up ~2 MB. Perhaps something is getting deep copied a bunch on windows
This seems about in line with what I expect. The latent growth is more apparent when using sequential CPU offload, but even then doesn't seem to get nearly that large. I expect windows has some sort of implementation of the cache that acts different. On Linux, I'm able to run a 10s (temp 31) video with only 12gb of vram available. I'm not quite sure where the drastically increasing ram is coming from though, as a [2,16,1,96,160] tensor at fp32 should only take up ~2 MB. Perhaps something is getting deep copied a bunch on windows
I can't get a 768p video to generate on Linux, at bf16. I run out of 24 GB VRAM around step 6. Could this have something to do with the Pytorch version? What Python and Pytorch version did you use?
I'm using 3.12 and torch 2.5.0, are you using an amd GPU?
I'm using 3.12 and torch 2.5.0, are you using an amd GPU?
Yes, Radeon 7900XTX. I wasn't aware of the PyTorch 2.5.0 release yet.
Also, try reducing VAE tile decode size (and make sure save_memory is enabled). This is currently on line 759 of the DiT pipeline file
image = self.vae.decode(latents, temporal_chunk=True, window_size=1, tile_sample_min_size=256).sample
Set tile_sample_min_size to multiples of 64 (multiples of 8 also seem to work, but slow down decoding significantly in my experience), lower will be less vram during the decoding step
I don't believe this is the source of the issue, but it's possible
I'm using 3.12 and torch 2.5.0, are you using an amd GPU?
Yes, Radeon 7900XTX. I wasn't aware of the PyTorch 2.5.0 release yet.
I can't really verify any behavior with AMD gpus as I don't have one and I'm not too familiar with how it works with pytorch. It may be the source of the issue
I tried with pytorch 2.5.0 and ROCm 6.2.2 but it doesn't work at all. Tried with Gradio as well, both resolutions.
```py import torch from PIL import Image from pyramid_dit import PyramidDiTForVideoGeneration from diffusers.utils import load_image, export_to_video torch.cuda.set_device(0) model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16 yet) model = PyramidDiTForVideoGeneration( 'pyramid_flow_model', # The downloaded checkpoint dir model_dtype, model_variant='diffusion_transformer_768p', # 'diffusion_transformer_384p' ) model.vae.enable_tiling() #model.vae.to("cuda") #model.dit.to("cuda") #model.text_encoder.to("cuda") prompt = "A dog walking on the beach." with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype): frames = model.generate( prompt=prompt, num_inference_steps=[20, 20, 20], video_num_inference_steps=[10, 10, 10], height=768, width=1280, temp=31, # temp=16: 5s, temp=31: 10s guidance_scale=9.0, # The guidance for the first frame, set it to 7 for 384p variant video_guidance_scale=5.0, # The guidance for the other video latent output_type="pil", cpu_offloading=True, save_memory=True, # If you have enough GPU memory, set it to False to improve vae decoding speed ) export_to_video(frames, "./text_to_video_sample.mp4", fps=24) ```
```
using half precision
Using temporal causal attention
We interp the position embedding of condition latents
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.64s/it]
The latent dimmension channes is 16
The start sigmas and end sigmas of each stage is Start: {0: 1.0, 1: 0.8002399489209289, 2: 0.5007496155411024}, End: {0:
0.6669999957084656, 1: 0.33399999141693115, 2: 0.0}, Ori_start: {0: 1.0, 1: 0.6669999957084656, 2: 0.33399999141693115}
/home/alex/ai/Pyramid-Flow/bugtest.py:22: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `t
orch.amp.autocast('cuda', args...)` instead.
with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
/home/alex/ai/Pyramid-Flow/venv-pyramidflow/lib/python3.10/site-packages/torch/nn/modules/linear.py:125: UserWarning: At
tempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at ..
/aten/src/ATen/Context.cpp:296.)
return F.linear(input, self.weight, self.bias)
0%| | 0/31 [00:00, ?it/s]
/home/alex/ai/Pyramid-Flow/pyramid_dit/modeling_mmdit_block.py:308: UserWarning: Memory Efficient attention on Navi31 GP
U is still experimental. Enable it with TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1. (Triggered internally at ../aten/src/
ATen/native/transformers/hip/sdp_utils.cpp:274.)
stage_hidden_states = F.scaled_dot_product_attention(
0%| | 0/31 [00:01, ?it/s]
Traceback (most recent call last):
File "/home/alex/ai/Pyramid-Flow/bugtest.py", line 23, in
@Ednaordinary just to let you know that the issue was solved on my side by a simple upgrade. it appeared i tried it only on python 3.9 + pytorch 2.1.2; the newer setup with python 3.11 + pytorch 2.4 behaves exactly as yours, fitting within 10.5gb vram (windows 10, 3090 24gb).
And the vram starts off low but gradually increases every step until it hits the limit. I have 24 GB of vram.