video preprocessing - Githubissues

9B8DY6 commented 1 year ago

In Phenaki paper, they downsample MiT dataset from 25fps to 6fps before video quantization.

Then, I wonder how to get downsampled video in preprocessing and whether input video is downsampled or not during training transformer and video generation inference. Even if you don't upload training and dataloader code for video, I want some advices from you who should have tried to implement it.

One more, I have implemented your c-vivit code for reconstruction. Then, after I got feasible outputs, I have gotten bad results in the very next checkpoint iteration like below. The left one is GT and the right one is the output. (I set checkpoint interval as 3000.)

Could I ask you what is wrong and is it supposed to be like that early stopping is required for tokenization learning?

Thank you.

lucidrains commented 1 year ago

@9B8DY6 the transformer trains on the quantized representation from the cvivit, so the frame rate is the same. it is fine if it is downsampled temporally, as we've seen from numerous papers that temporal upsampling (interpolation) works just fine

yea i'll get some training code down soon for phenaki, as there are a lot of details that is required for stable attention net training (as well as automating the entire adversarial training portion, which may be too complicated for the uninitiated)

lucidrains commented 1 year ago

@9B8DY6 in yesterday's demo they are doing upsampling with ddpm. can do this too with imagen-pytorch, once i get the logic for temporal upsampling in place

lucidrains / phenaki-pytorch

video preprocessing #7