Closed wladradchenko closed 2 weeks ago
The underlying mechanism is indeed similar, CPU offloading manually moves unused modules to the CPU device and keeps the currently used module on the GPU.
@jy0205 @feifeiobama Thank you. Got it. I have another question about the approach. As I understand, it is possible to generate video from text or image, and the approach is based on Stable Diffusion 3. But, is it possible to generate video with this approach based on the first and last frame? If it is not possible in the current repository, does the approach itself have such capabilities or is this an insurmountable limitation?
I think this limitation comes from our MAGVIT2-like VAE. Since it explicitly encodes only the first frame, it is much easier to do conditioning on the first frame than on both the first and last frames. Of course, this can be overcome with architectural changes, such as adding a conditioning branch for the last frame.
Hi! I have a question about the memory optimization introduced in PR #75. Specifically, I’m curious about how the approach reduces the required video memory during generation.
I’ve encountered a method where different layers of large models are assigned to various devices in one machine (not cluster) to manage memory:
This kind of device mapping allows loading on both GPUs and CPUs (for example change cuda:1 on cpu). Is the optimization in this PR based on a similar concept, or is there a different mechanism at work here?
Thanks in advance for any insights!