VRAM increasing slightly every step until I run out.

dillfrescott commented 1 month ago

import torch
from PIL import Image
from pyramid_dit import PyramidDiTForVideoGeneration
from diffusers.utils import load_image, export_to_video

torch.cuda.set_device(0)
model_dtype, torch_dtype = 'bf16', torch.bfloat16   # Use bf16 (not support fp16 yet)

model = PyramidDiTForVideoGeneration(
    'PATH',                                         # The downloaded checkpoint dir
    model_dtype,
    model_variant='diffusion_transformer_768p',     # 'diffusion_transformer_384p'
)

model.vae.enable_tiling()
#model.vae.to("cuda")
#model.dit.to("cuda")
#model.text_encoder.to("cuda")

prompt = "A dog walking on the beach."

with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype):
    frames = model.generate(
        prompt=prompt,
        num_inference_steps=[20, 20, 20],
        video_num_inference_steps=[10, 10, 10],
        height=768,     
        width=1280,
        temp=31,                    # temp=16: 5s, temp=31: 10s
        guidance_scale=9.0,         # The guidance for the first frame, set it to 7 for 384p variant
        video_guidance_scale=5.0,   # The guidance for the other video latent
        output_type="pil",
        cpu_offloading=True,
        save_memory=True,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video(frames, "./text_to_video_sample.mp4", fps=24)

And the vram starts off low but gradually increases every step until it hits the limit. I have 24 GB of vram.

dillfrescott commented 1 month ago

Even using app.py, it hits step 14 and starts flooding into shared ram, halting progress.

Ednaordinary commented 1 month ago

I'm unable to reproduce using your shared script and the latest commit. Make sure your repo is up to date as the fix for this just got merged yesterday. Disabling system memory fallback (I believe this is something you can do in NVIDIA control panel, though I don't use windows) will also fix this, as the gpu will recognize its time to deallocate when it reaches the max

dillfrescott commented 1 month ago

Okay thank you!

dillfrescott commented 1 month ago

I think that was my issue. I disabled the sysmem fallback and it seems to be helping.

dillfrescott commented 1 month ago

nevermind. Now it says:

Traceback (most recent call last):
  File "text.py", line 23, in <module>
    frames = model.generate(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\pyramid_dit_for_video_gen_pipeline.py", line 703, in generate
    intermed_latents = self.generate_one_unit(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\pyramid_dit_for_video_gen_pipeline.py", line 285, in generate_one_unit
    noise_pred = self.dit(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_pyramid_mmdit.py", line 479, in forward
    encoder_hidden_states, hidden_states = block(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 640, in forward
    attn_output, context_attn_output = self.attn(
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "C:\Users\cross\miniconda3\envs\pyramid\lib\site-packages\torch\nn\modules\module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 548, in forward
    hidden_states, encoder_hidden_states = self.var_len_attn(
  File "C:\Users\cross\Downloads\Pyramid-Flow\pyramid_dit\modeling_mmdit_block.py", line 308, in __call__
    stage_hidden_states = F.scaled_dot_product_attention(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 13.68 GiB. GPU 0 has a total capacty of 23.99 GiB of which 12.48 GiB is free. Of the allocated memory 6.10 GiB is allocated by PyTorch, and 3.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

dillfrescott commented 1 month ago

I dont understand why it is doing this still

Ednaordinary commented 1 month ago

hmm. Do you have the temp, res, or something else set super high? It shouldn't be allocating 14 gb with the script you provided in the original post

dillfrescott commented 1 month ago

I tried app.py as well and it got to step 17 and crashed (OOM). I have not modified anything.

dillfrescott commented 1 month ago

I set the steps to 31 because I want a 10 second video, but everything else is the default.

agronholm commented 1 month ago

I get a crash after step 6. Radeon 7900XTX (24 GB VRAM). 80 GB system RAM.

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 8.46 GiB. GPU 0 has a total capacity of 23.98 GiB of which 7.98 GiB is free. Of the allocated memory 14.00 GiB is allocated by PyTorch, and 1.64 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

eps696 commented 1 month ago

i've got the same situation as reported above: memory usage goes up and down in line with the gc collecting and cuda cache cleaning, but on every step it goes a bit higher until OOM at step 17 (so it's not a VAE issue). i've got 3090 24gb, windows 10. my configuration is also the same: diffusion_transformer_768p, 10 sec (temp=31) generation, dtype bf16, cpu_offloading=True (in generate function), sequential_cpu_offload disabled. seriously wondering what's the difference with your settings that you managed to fit it within 12gb vram

eps696 commented 1 month ago

one guess for such difference in behaviour was that gc.collect() or torch.cuda.empty_cache() worked differently on linux and windows, not cleaning up vram as effectively on the latter. however, i haven't faced such issues for quite a few years of active dealing with this optimization technique. yet i see that the generation loop populates generated_latents_list with a new item from intermed_latents on every step, so this vram occupation growth looks quite natural; so i'm more curious how it may stay the same on your side, where do those new generated latents go then?

UPD: more careful observation has shown that the amount of vram required for a self.dit run within generate_one_unit gradually increases with the increase of the input latents shapes. e.g. 3rd stage takes single [2,16,1,96,160] (~7GB cache) on unit_index 0, but [2,16,7,24,40] + [2,16,1,48,80] + [2,16,1,96,160] + [2,16,1,96,160] (~18gb cache) on unit_index 9. this growth looks pretty natural and therefore inevitable; if it's not like that on your side - there should be a reason.

Ednaordinary commented 1 month ago

This seems about in line with what I expect. The latent growth is more apparent when using sequential CPU offload, but even then doesn't seem to get nearly that large. I expect windows has some sort of implementation of the cache that acts different. On Linux, I'm able to run a 10s (temp 31) video with only 12gb of vram available. I'm not quite sure where the drastically increasing vram is coming from though, as a [2,16,1,96,160] tensor at fp32 should only take up ~2 MB. Perhaps something is getting deep copied a bunch on windows

agronholm commented 1 month ago

This seems about in line with what I expect. The latent growth is more apparent when using sequential CPU offload, but even then doesn't seem to get nearly that large. I expect windows has some sort of implementation of the cache that acts different. On Linux, I'm able to run a 10s (temp 31) video with only 12gb of vram available. I'm not quite sure where the drastically increasing ram is coming from though, as a [2,16,1,96,160] tensor at fp32 should only take up ~2 MB. Perhaps something is getting deep copied a bunch on windows

I can't get a 768p video to generate on Linux, at bf16. I run out of 24 GB VRAM around step 6. Could this have something to do with the Pytorch version? What Python and Pytorch version did you use?

Ednaordinary commented 1 month ago

I'm using 3.12 and torch 2.5.0, are you using an amd GPU?

agronholm commented 1 month ago

I'm using 3.12 and torch 2.5.0, are you using an amd GPU?

Yes, Radeon 7900XTX. I wasn't aware of the PyTorch 2.5.0 release yet.

Ednaordinary commented 1 month ago

Also, try reducing VAE tile decode size (and make sure save_memory is enabled). This is currently on line 759 of the DiT pipeline file

            image = self.vae.decode(latents, temporal_chunk=True, window_size=1, tile_sample_min_size=256).sample

Set tile_sample_min_size to multiples of 64 (multiples of 8 also seem to work, but slow down decoding significantly in my experience), lower will be less vram during the decoding step

I don't believe this is the source of the issue, but it's possible

Ednaordinary commented 1 month ago

I'm using 3.12 and torch 2.5.0, are you using an amd GPU?

Yes, Radeon 7900XTX. I wasn't aware of the PyTorch 2.5.0 release yet.

I can't really verify any behavior with AMD gpus as I don't have one and I'm not too familiar with how it works with pytorch. It may be the source of the issue

agronholm commented 1 month ago

I tried with pytorch 2.5.0 and ROCm 6.2.2 but it doesn't work at all. Tried with Gradio as well, both resolutions.

The script I used

```py import torch from PIL import Image from pyramid_dit import PyramidDiTForVideoGeneration from diffusers.utils import load_image, export_to_video torch.cuda.set_device(0) model_dtype, torch_dtype = 'bf16', torch.bfloat16 # Use bf16 (not support fp16 yet) model = PyramidDiTForVideoGeneration( 'pyramid_flow_model', # The downloaded checkpoint dir model_dtype, model_variant='diffusion_transformer_768p', # 'diffusion_transformer_384p' ) model.vae.enable_tiling() #model.vae.to("cuda") #model.dit.to("cuda") #model.text_encoder.to("cuda") prompt = "A dog walking on the beach." with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype): frames = model.generate( prompt=prompt, num_inference_steps=[20, 20, 20], video_num_inference_steps=[10, 10, 10], height=768, width=1280, temp=31, # temp=16: 5s, temp=31: 10s guidance_scale=9.0, # The guidance for the first frame, set it to 7 for 384p variant video_guidance_scale=5.0, # The guidance for the other video latent output_type="pil", cpu_offloading=True, save_memory=True, # If you have enough GPU memory, set it to False to improve vae decoding speed ) export_to_video(frames, "./text_to_video_sample.mp4", fps=24) ```

Console output

``` using half precision Using temporal causal attention We interp the position embedding of condition latents You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.64s/it] The latent dimmension channes is 16 The start sigmas and end sigmas of each stage is Start: {0: 1.0, 1: 0.8002399489209289, 2: 0.5007496155411024}, End: {0: 0.6669999957084656, 1: 0.33399999141693115, 2: 0.0}, Ori_start: {0: 1.0, 1: 0.6669999957084656, 2: 0.33399999141693115} /home/alex/ai/Pyramid-Flow/bugtest.py:22: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `t orch.amp.autocast('cuda', args...)` instead. with torch.no_grad(), torch.cuda.amp.autocast(enabled=True, dtype=torch_dtype): /home/alex/ai/Pyramid-Flow/venv-pyramidflow/lib/python3.10/site-packages/torch/nn/modules/linear.py:125: UserWarning: At tempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (Triggered internally at .. /aten/src/ATen/Context.cpp:296.) return F.linear(input, self.weight, self.bias) 0%| | 0/31 [00:00 frames = model.generate( File "/home/alex/ai/Pyramid-Flow/venv-pyramidflow/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/alex/ai/Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py", line 664, in generate intermed_latents = self.generate_one_unit( [27/1823] File "/home/alex/ai/Pyramid-Flow/venv-pyramidflow/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context return func(*args, **kwargs) File "/home/alex/ai/Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py", line 266, in generate_one_unit noise = self.sample_block_noise(bs, ch, temp, height, width) File "/home/alex/ai/Pyramid-Flow/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py", line 224, in sample_block_noise dist = torch.distributions.multivariate_normal.MultivariateNormal(torch.zeros(4), torch.eye(4) * (1 + gamma) - torch .ones(4, 4) * gamma) File "/home/alex/ai/Pyramid-Flow/venv-pyramidflow/lib/python3.10/site-packages/torch/distributions/multivariate_normal .py", line 180, in __init__ super().__init__(batch_shape, event_shape, validate_args=validate_args) File "/home/alex/ai/Pyramid-Flow/venv-pyramidflow/lib/python3.10/site-packages/torch/distributions/distribution.py", l ine 71, in __init__ raise ValueError( ValueError: Expected parameter covariance_matrix (Tensor of shape (4, 4)) of distribution MultivariateNormal(loc: torch. Size([4]), covariance_matrix: torch.Size([4, 4])) to satisfy the constraint PositiveDefinite(), but found invalid values : tensor([[ 1.0000, -0.3333, -0.3333, -0.3333], [-0.3333, 1.0000, -0.3333, -0.3333], [-0.3333, -0.3333, 1.0000, -0.3333], [-0.3333, -0.3333, -0.3333, 1.0000]]) ```

eps696 commented 4 weeks ago

@Ednaordinary just to let you know that the issue was solved on my side by a simple upgrade. it appeared i tried it only on python 3.9 + pytorch 2.1.2; the newer setup with python 3.11 + pytorch 2.4 behaves exactly as yours, fitting within 10.5gb vram (windows 10, 3090 24gb).

jy0205 / Pyramid-Flow

VRAM increasing slightly every step until I run out. #86