Closed JunyaoHu closed 1 year ago
Yes, you're right that in this experiment we are predicting 964 frames given the previous 36.
Our architecture has memory requirements proportional to (number of frames predicted + number of frames conditioned on). Therefore, to avoid out-of-memory errors, we keep this number at a maximum of 20 (i.e. we predict 10 conditioned on another 10) and therefore cannot use all 36 initial frames. We still allow our baselines (e.g. CWVAE) to condition on all 36 so, if you are able to condition on all 36 frames, I think this is still a reasonable comparison.
thanks a lot!
btw, how much time did you spend on predicting a 1000-frames video?
On an A5000 GPU, it takes about 2 minutes to generate 10 frames, so roughly 3.2 hours to generate 964 frames
For my understanding, for long-term 1000-frames video prediction, in your paper, you have 36 conditional/observed frames and FDM model predicts 964 predicted/sampled/latent/future frames. Is it right?
Why did not you make use of all conditional/observed frames to predict? Because we set $K=10$, so we only can select 10 frames to predict? And if I want to follow your work and compare the result, can I use all 36 frames?
For example of Autoregression: