jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling
https://pyramid-flow.github.io/
MIT License
2.4k stars 233 forks source link

Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

Open rob-hen opened 2 weeks ago

rob-hen commented 2 weeks ago

Hi all,

the provided script train_pyramid_flow.sh does not set the flag use_sequence_parallel. In that case, what is the purpose of using VIDEO_SYNC_GROUP=8? Why we want all workers to use the same video?

jy0205 commented 2 weeks ago

Hi, we do not use the sequence parallel during training. The VIDEO_SYNC_GROUP controls the number of processes that accept the same video batch as input. We find such a trick will make the gradient direction more stable (optimize the performance of the whole latent sequence of a video, not just a latent from different videos).

rob-hen commented 2 weeks ago

Hi @jy0205,

thank you for the answer. So with VIDEO_SYNC_GROUP =8 and GPUS=8, all GPUs get exactly the same videos. However, I don't see the difference between the processes, all will use exactly the same latent (the same clip from the videos): https://github.com/jy0205/Pyramid-Flow/blob/e4b02ef31edba13e509896388b1fedd502ea767c/dataset/dataset_cls.py#L192 .

yjhong89 commented 2 weeks ago

I think video_sync_group doesn't split same video latent, but accept same video latent without splitting.

yjhong89 commented 2 weeks ago
jy0205 commented 2 weeks ago

I think video_sync_group doesn't split same video latent, but accept same video latent without splitting.

  • This part is different to sequece parallel, which split latent according to time axis.
  • Is that right ??

Yes, you are right. The video_sync_group does not split the video. It works since different video ranks load different video lengths. You can find in the sample_length method.

jy0205 commented 2 weeks ago

All the stages employ the uniform sampling. We will make the video token sequence length-balanced (let the token length sum to be fixed)