Closed drwootton closed 2 weeks ago
The intents of multi-gpu infeence are both to reduce memory usage and to speed up. But (1) we haven't tested it with CPU offloading, (2) it's based on Sequence Parallelism, so it requires loading the model on each GPU which has a certain lower bound rather than directly halving the GPU usage.
For now, we suggest trying out CPU offloading with the single-GPU inference script, which should be able to run within 12GB memory.
I have a RTX 3060 and RTX 4070 in my system, both 12GB. Since the X server runs on my RTX 4070 I have only about 11GB VRAM there so with X server running, I can run the single GPU script on the project page successfully only on the RTX 3060. If I switch to run mode 3 (no X sever) then I can run that script on either GPU. I updated my git repo to current code as of today, Oct 15 I tried text to video with the scripts/inference_multigpu.sh script. I changed the inference_multigpu.py script to set cpu_offloading=True in both places and that did not help. I tried adding model.enable_sequential_cpu_offload() and that did not help. I tried adding model.enable_sequential_cpu_offload() just before the model.to.vae(device) statement and that did not help. I get the out of memory error for both the 384P and 768P models. Is the intent of the multi-GPU support to cut memory usage in each GPU by about half and split a single frame's generation across both GPUs or is it to speed up generation by generating single frame each on separate GPUs to shorten run time?