showlab / Tune-A-Video

[ICCV 2023] Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
https://tuneavideo.github.io
Apache License 2.0
4.22k stars 384 forks source link

In the training step, do you shuffle clips? #31

Closed HyeonHo99 closed 1 year ago

HyeonHo99 commented 1 year ago

Hi, thank you first for amazing works.

I have a few questions about your training process. (1) Did you fix the number of frames (clips) as 24? Because in every config file, clip length is consistently 24. Does it impose that any number bigger or smaller than 24 doesn't perform as well as 24?

(2) In the training step, do you shuffle the order of frames (clips)? I have a feeling that it is not proper to shuffle the frames because the frame-related attention parts learn the order of frames too?

Thank you again.

zhangjiewu commented 1 year ago

Hi @HyeonHo99, thank you for your interest in our work. Below are some comments regarding your questions:

  1. Here we set the number of frames to 24 that the code can be run on a 24GB GPU. Feel free to explore more choices of video length.
  2. In our experiments, we have not shuffled the order of frames. If you want to try video-image co-training, you can disable the temporal components when training on images.