jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
2.4k stars 233 forks source link

Question on Memory Optimization in PR #75 #150

Closed wladradchenko closed 2 weeks ago

wladradchenko commented 2 weeks ago

Hi! I have a question about the memory optimization introduced in PR #75. Specifically, I’m curious about how the approach reduces the required video memory during generation.

I’ve encountered a method where different layers of large models are assigned to various devices in one machine (not cluster) to manage memory:

device_map = {
    'encoder.layer.0': 'cuda:0',
    'encoder.layer.1': 'cuda:1',
    'decoder.layer.0': 'cuda:0',
    'decoder.layer.1': 'cuda:1',
}

This kind of device mapping allows loading on both GPUs and CPUs (for example change cuda:1 on cpu). Is the optimization in this PR based on a similar concept, or is there a different mechanism at work here?

Thanks in advance for any insights!

feifeiobama commented 2 weeks ago

The underlying mechanism is indeed similar, CPU offloading manually moves unused modules to the CPU device and keeps the currently used module on the GPU.

wladradchenko commented 2 weeks ago

@jy0205 @feifeiobama Thank you. Got it. I have another question about the approach. As I understand, it is possible to generate video from text or image, and the approach is based on Stable Diffusion 3. But, is it possible to generate video with this approach based on the first and last frame? If it is not possible in the current repository, does the approach itself have such capabilities or is this an insurmountable limitation?

feifeiobama commented 2 weeks ago

I think this limitation comes from our MAGVIT2-like VAE. Since it explicitly encodes only the first frame, it is much easier to do conditioning on the first frame than on both the first and last frames. Of course, this can be overcome with architectural changes, such as adding a conditioning branch for the last frame.