jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
1.89k stars 161 forks source link

System requirement #5

Open kexul opened 1 week ago

kexul commented 1 week ago

Thanks for the great work! Would you mind share the system requirement to run inference? Can I run it on free google colab T4 gpu with 15G GPU RAM?

kexul commented 1 week ago

diffusion_transformer_768p and bf16 OOM in my 4090 with 24G GPU RAM.

kexul commented 1 week ago

diffusion_transformer_384p got OOM too..

jy0205 commented 1 week ago

Thanks for your attention! We are optimizing the usage of GPU memory now.

Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

kexul commented 1 week ago

Thanks for your attention! We are optimizing the usage of GPU memory now.

Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage:

https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

rrainist commented 1 week ago

Thanks! and tile_sample_min_size=256 is great for A6000 (it went to near 40gb).

sumitmamoria commented 1 week ago

Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

What kind of speeds are you getting with this setup?

kexul commented 1 week ago

Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623

Thanks for the tip! tile_sample_min_size=128 seems to be a proper value for my 4090.

What kind of speeds are you getting with this setup?

About one minute per clip.

FurkanGozukara commented 1 week ago

@jy0205 awesome can you add all these parameters to gradio demo please?

aniolekx commented 1 week ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

eizoxx commented 1 week ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

kexul commented 1 week ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

You may need to assign the cuda device manually for different part of the network, if you are familiar with pytorch. Just check the cpu offload code, it's similar.

feifeiobama commented 1 week ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.

feifeiobama commented 1 week ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

We are working on a solution for multi-GPU inference. Stay tuned.

eizoxx commented 1 week ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.

I will make a new issue for this topic. Or I will use my old one...

aniolekx commented 1 week ago

how to run it on 2x RTX3090?

or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?

We are working on a solution for multi-GPU inference. Stay tuned.

any predictions, how long will this take? 🤔

feifeiobama commented 1 week ago

Is it normal to render a 5 seconds video for 1,5 hours on an A6000?

It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.

I will make a new issue for this topic. Or I will use my old one...

Feel free to raise a new issue.

Qiizoff commented 1 week ago

[INFO] Model initialized successfully. [INFO] Starting text-to-video generation... [ERROR] Error during text-to-video generation: Torch not compiled with CUDA enabled

Qiizoff commented 1 week ago

it worked for me: conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia

secmanhck commented 1 week ago

If you see this error [INFO] Model initialized successfully. [INFO] Starting text-to-video generation... [ERROR] Error during text-to-video generation: Torch not compiled with CUDA enabled

You can use this to fix this issue
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia