jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
2.49k stars 244 forks source link

System requirement #5

Open kexul opened 1 month ago

kexul commented 1 month ago

Thanks for the great work! Would you mind share the system requirement to run inference? Can I run it on free google colab T4 gpu with 15G GPU RAM?

kexul commented 1 month ago

diffusion_transformer_768p and bf16 OOM in my 4090 with 24G GPU RAM.

kexul commented 1 month ago

diffusion_transformer_384p got OOM too..

jy0205 commented 1 month ago

Thanks for your attention! We are optimizing the usage of GPU memory now.

Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

kexul commented 1 month ago

Thanks for your attention! We are optimizing the usage of GPU memory now.

Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage:

https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

rrainist commented 1 month ago

Thanks! and tile_sample_min_size=256 is great for A6000 (it went to near 40gb).

sumitmamoria commented 1 month ago

Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

What kind of speeds are you getting with this setup?

kexul commented 1 month ago

Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

What kind of speeds are you getting with this setup?

About one minute per clip.

FurkanGozukara commented 1 month ago

@jy0205 awesome can you add all these parameters to gradio demo please?

aniolekx commented 1 month ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

eizoxx commented 1 month ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

kexul commented 1 month ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

You may need to assign the cuda device manually for different part of the network, if you are familiar with pytorch. Just check the cpu offload code, it's similar.

feifeiobama commented 1 month ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.

feifeiobama commented 1 month ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

We are working on a solution for multi-GPU inference. Stay tuned.

eizoxx commented 1 month ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.

I will make a new issue for this topic. Or I will use my old one...

aniolekx commented 1 month ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

We are working on a solution for multi-GPU inference. Stay tuned.

any predictions, how long will this take? 🤔

feifeiobama commented 1 month ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.

I will make a new issue for this topic. Or I will use my old one...

Feel free to raise a new issue.

Qiizoff commented 1 month ago

[INFO] Model initialized successfully. [INFO] Starting text-to-video generation... [ERROR] Error during text-to-video generation: Torch not compiled with CUDA enabled

Qiizoff commented 1 month ago

it worked for me: conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

secmanhck commented 1 month ago

If you see this error [INFO] Model initialized successfully. [INFO] Starting text-to-video generation... [ERROR] Error during text-to-video generation: Torch not compiled with CUDA enabled

You can use this to fix this issue
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia

ahaubenstock commented 4 weeks ago

My memory usage is around 24GB during gen and then spikes to 65GB during decode. I tried lowering tile_sample_min_size to 32, but it didn't reduce memory usage at all. Anyone have an idea what I am missing?

Polimagoo commented 2 days ago

Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

tile_sample_min_size=128 Worked like a charm on my 2x2080ti, still waiting for a way to get those 2 working together....