How to use this model for inference

Hello!

Thanks for your great implementation. I was just wondering how to use this code for actual video future frame prediction. Say that I have pretrained the vqvae to compress 16 3 256 256 video and trained a pixel_snail model on that compressed latent. Now if I have 4 3 256 256 video, what am I supposed to do for inference? I am a bit confused even after reading the paper.

Thanks.