lucidrains / phenaki-pytorch

Implementation of Phenaki Video, which uses Mask GIT to produce text guided videos of up to 2 minutes in length, in Pytorch
MIT License
739 stars 78 forks source link

video preprocessing #7

Open 9B8DY6 opened 1 year ago

9B8DY6 commented 1 year ago

In Phenaki paper, they downsample MiT dataset from 25fps to 6fps before video quantization. image

Then, I wonder how to get downsampled video in preprocessing and whether input video is downsampled or not during training transformer and video generation inference. Even if you don't upload training and dataloader code for video, I want some advices from you who should have tried to implement it.

One more, I have implemented your c-vivit code for reconstruction. Then, after I got feasible outputs, I have gotten bad results in the very next checkpoint iteration like below. The left one is GT and the right one is the output. (I set checkpoint interval as 3000.) image

Could I ask you what is wrong and is it supposed to be like that early stopping is required for tokenization learning?

Thank you.

lucidrains commented 1 year ago

@9B8DY6 the transformer trains on the quantized representation from the cvivit, so the frame rate is the same. it is fine if it is downsampled temporally, as we've seen from numerous papers that temporal upsampling (interpolation) works just fine

yea i'll get some training code down soon for phenaki, as there are a lot of details that is required for stable attention net training (as well as automating the entire adversarial training portion, which may be too complicated for the uninitiated)

lucidrains commented 1 year ago

@9B8DY6 in yesterday's demo they are doing upsampling with ddpm. can do this too with imagen-pytorch, once i get the logic for temporal upsampling in place