plai-group / flexible-video-diffusion-modeling

MIT License
113 stars 14 forks source link

Confirm how to generate 1000-frames long video #2

Closed JunyaoHu closed 1 year ago

JunyaoHu commented 1 year ago

For my understanding, for long-term 1000-frames video prediction, in your paper, you have 36 conditional/observed frames and FDM model predicts 964 predicted/sampled/latent/future frames. Is it right?

image

Why did not you make use of all conditional/observed frames to predict? Because we set $K=10$, so we only can select 10 frames to predict? And if I want to follow your work and compare the result, can I use all 36 frames?

For example of Autoregression:

image

wsgharvey commented 1 year ago

Yes, you're right that in this experiment we are predicting 964 frames given the previous 36.

Our architecture has memory requirements proportional to (number of frames predicted + number of frames conditioned on). Therefore, to avoid out-of-memory errors, we keep this number at a maximum of 20 (i.e. we predict 10 conditioned on another 10) and therefore cannot use all 36 initial frames. We still allow our baselines (e.g. CWVAE) to condition on all 36 so, if you are able to condition on all 36 frames, I think this is still a reasonable comparison.

JunyaoHu commented 1 year ago

thanks a lot!

JunyaoHu commented 1 year ago

btw, how much time did you spend on predicting a 1000-frames video?

wsgharvey commented 1 year ago

On an A5000 GPU, it takes about 2 minutes to generate 10 frames, so roughly 3.2 hours to generate 964 frames