Open kexul opened 1 month ago
diffusion_transformer_768p and bf16 OOM in my 4090 with 24G GPU RAM.
diffusion_transformer_384p got OOM too..
Thanks for your attention! We are optimizing the usage of GPU memory now.
Specifically, we use a tiled VAE decoding strategy, where the current setting of tile_sample_min_size
prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623
Thanks for your attention! We are optimizing the usage of GPU memory now.
Specifically, we use a tiled VAE decoding strategy, where the current setting of
tile_sample_min_size
prioritizes speed, but it can be changed to significantly reduce GPU memory usage:
Thanks for the tip! tile_sample_min_size=128
seems to be a proper value for my 4090.
Thanks! and tile_sample_min_size=256 is great for A6000 (it went to near 40gb).
Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of
tile_sample_min_size
prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623Thanks for the tip!
tile_sample_min_size=128
seems to be a proper value for my 4090.
What kind of speeds are you getting with this setup?
Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of
tile_sample_min_size
prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623Thanks for the tip!
tile_sample_min_size=128
seems to be a proper value for my 4090.What kind of speeds are you getting with this setup?
About one minute per clip.
@jy0205 awesome can you add all these parameters to gradio demo please?
how to run it on 2x RTX3090?
or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?
Is it normal to render a 5 seconds video for 1,5 hours on an A6000?
how to run it on 2x RTX3090?
or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?
You may need to assign the cuda device manually for different part of the network, if you are familiar with pytorch. Just check the cpu offload code, it's similar.
Is it normal to render a 5 seconds video for 1,5 hours on an A6000?
It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.
how to run it on 2x RTX3090?
or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?
We are working on a solution for multi-GPU inference. Stay tuned.
Is it normal to render a 5 seconds video for 1,5 hours on an A6000?
It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.
I will make a new issue for this topic. Or I will use my old one...
how to run it on 2x RTX3090?
or maybe a better solution will be to buy a desktop with a dual processor, 12 memory channels each, ddr5, and total memory bandwidth 800GB/s ? single card with 48gb vram is too expensive and have similar memory bandwidth.. I'm wondering who could answer this question?
We are working on a solution for multi-GPU inference. Stay tuned.
any predictions, how long will this take? 🤔
Is it normal to render a 5 seconds video for 1,5 hours on an A6000?
It's not normal, please check the code and GPU usage. Maybe you have turned on CPU offloading. It would be better if you could share where the code is stuck at.
I will make a new issue for this topic. Or I will use my old one...
Feel free to raise a new issue.
[INFO] Model initialized successfully. [INFO] Starting text-to-video generation... [ERROR] Error during text-to-video generation: Torch not compiled with CUDA enabled
it worked for me: conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.4 -c pytorch -c nvidia
If you see this error [INFO] Model initialized successfully. [INFO] Starting text-to-video generation... [ERROR] Error during text-to-video generation: Torch not compiled with CUDA enabled
You can use this to fix this issue
conda install pytorch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 pytorch-cuda=11.8 -c pytorch -c nvidia
My memory usage is around 24GB during gen and then spikes to 65GB during decode. I tried lowering tile_sample_min_size
to 32, but it didn't reduce memory usage at all. Anyone have an idea what I am missing?
Thanks for your attention! We are optimizing the usage of GPU memory now. Specifically, we use a tiled VAE decoding strategy, where the current setting of
tile_sample_min_size
prioritizes speed, but it can be changed to significantly reduce GPU memory usage: https://github.com/jy0205/Pyramid-Flow/blob/e3389908eb34f08253f5f1e2c72f383ec052629d/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L623Thanks for the tip!
tile_sample_min_size=128
seems to be a proper value for my 4090.
tile_sample_min_size=128
Worked like a charm on my 2x2080ti, still waiting for a way to get those 2 working together....
Thanks for the great work! Would you mind share the system requirement to run inference? Can I run it on free google colab T4 gpu with 15G GPU RAM?