jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
2.17k stars 200 forks source link

Questions about implementation #148

Closed yjhong89 closed 19 hours ago

yjhong89 commented 3 days ago

Hi! Thanks for sharing training code!

While I analyzing implementation in details and have few questions.

  1. https://github.com/jy0205/Pyramid-Flow/blob/e4b02ef31edba13e509896388b1fedd502ea767c/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L451

    • Why use indexing [index::column_size] ? Since latent_list[i_s] would have shape of [bs, c, t, h, w] so latents_list[i_s][index::column_size] means just getting one batch, isn't it?
  2. How video sync group works?

    • If I use 8 gpus and default parallel group hyper-parameter setting, sp_group_size and video_sync_group would be 8.
    • Since sequential parallel already split long tokens so every gpu gets access to the same video input, why video_sync_group is necessary??
  3. When extract video latent in advance, all videos have same fps ?? Since this line means if "frame" is not specified in annotation, extract first 121 frames

  4. Why multiplying 2 in here? to preserve variance for each stage ?

Thanks!

jy0205 commented 3 days ago

Here are the answers to your questions:

  1. latents_list[i_s][index::column_size] aims to get a batch of samples that belong to the same stage.
  2. We do not use the sequence parallel in the training. The code about sequence parallel is for the multi-gpu inference. The param video_sync_group is for controlling the group of processes that accept the same input sample.
  3. We directly use 24 fps for training. The frames key means you can specify the frame indexes you want to extract.
  4. Multiplying is to make the variance of noise still equal to 1 after bilinear interpolation. (Statisfy standard Gaussian)
yjhong89 commented 23 hours ago

Thanks!

yjhong89 commented 22 hours ago

Another questions

feifeiobama commented 21 hours ago

Theoretically, it naturally performs I2V training during autoregressive training (since the first frame is an image). However, we have not explicitly optimized for I2V, so the performance may be suboptimal. We are working on some improvements and will share them in due time.

yjhong89 commented 21 hours ago

Yes, sounds right. autoregressive training naturally doing I2V training.

Another question ?

feifeiobama commented 21 hours ago

Great observation! Please refer to https://github.com/jy0205/Pyramid-Flow/issues/28#issuecomment-2406892327.

yjhong89 commented 19 hours ago

Thanks for quick answer!