Training without Sequence Parallelism but VIDEO_SYNC_GROUP

jy0205 / Pyramid-Flow

Code of Pyramidal Flow Matching for Efficient Video Generative Modeling

https://pyramid-flow.github.io/

MIT License

2.4k stars 233 forks source link

Training without Sequence Parallelism but VIDEO_SYNC_GROUP #162

Open rob-hen opened 2 weeks ago

rob-hen commented 2 weeks ago

Hi all,

the provided script train_pyramid_flow.sh does not set the flag use_sequence_parallel. In that case, what is the purpose of using VIDEO_SYNC_GROUP=8? Why we want all workers to use the same video?

jy0205 commented 2 weeks ago

Hi, we do not use the sequence parallel during training. The VIDEO_SYNC_GROUP controls the number of processes that accept the same video batch as input. We find such a trick will make the gradient direction more stable (optimize the performance of the whole latent sequence of a video, not just a latent from different videos).

rob-hen commented 2 weeks ago

Hi @jy0205,

thank you for the answer. So with VIDEO_SYNC_GROUP =8 and GPUS=8, all GPUs get exactly the same videos. However, I don't see the difference between the processes, all will use exactly the same latent (the same clip from the videos): https://github.com/jy0205/Pyramid-Flow/blob/e4b02ef31edba13e509896388b1fedd502ea767c/dataset/dataset_cls.py#L192 .

yjhong89 commented 2 weeks ago

I think video_sync_group doesn't split same video latent, but accept same video latent without splitting.

This part is different to sequece parallel, which split latent according to time axis.
Is that right ??

yjhong89 commented 2 weeks ago

Why only the number of high resolution units are uniformly sampled??
https://github.com/jy0205/Pyramid-Flow/blob/e4b02ef31edba13e509896388b1fedd502ea767c/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L360

jy0205 commented 2 weeks ago

I think video_sync_group doesn't split same video latent, but accept same video latent without splitting.

This part is different to sequece parallel, which split latent according to time axis.

Is that right ??

Yes, you are right. The video_sync_group does not split the video. It works since different video ranks load different video lengths. You can find in the sample_length method.

jy0205 commented 2 weeks ago

Why only the number of high resolution units are uniformly sampled??

https://github.com/jy0205/Pyramid-Flow/blob/e4b02ef31edba13e509896388b1fedd502ea767c/pyramid_dit/pyramid_dit_for_video_gen_pipeline.py#L360

All the stages employ the uniform sampling. We will make the video token sequence length-balanced (let the token length sum to be fixed)