[Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame

Sazoji commented 3 years ago

Judging by the results, the transformer is taking in a single frame, and would be considered an Image to Video process. Something like video inpainting or camera FOV extrapolation(like in FGVC) would be input video -> output video. Am I missing something in the documentation that maybe shows it as some sort of sparse video interpolation where it can input more than a (D1, D2, single frame); or was it called V2V in order to match the I2I label on the inpainting/image completion counterparts?

Additionally, there isn't a direct link to the paper, which documents that the V2V model only takes in a single image. https://arxiv.org/abs/2111.12417

chenfei-wu commented 3 years ago

We view image as a speical video with one frame. As a result, image-to-video generation can viewed as a special case of video-to-video generation.

Sazoji commented 3 years ago

Ok, I'll agree that frame to video can be seen as a special case of V2V generation. I was going to close this yesterday, but GitHub was down during my break. I'd just like to mention that this method is not the type of V2V usage one would be looking for when trying to do video completion or inpainting, which seemed to be implied when put below image completion.

An actual example of V2V synthesis would be a doman change or style transfer, like a video label encoder -> photorealistic video decoder, NUWA-Infinity seems to have the capacity to change style via a conditioned decoder, and properly labels the synthetic models as video prediction and generation based on what's encoded (images and text, IE not video). Would still like to see how video encoders could be implemented.

microsoft / NUWA

[Documentation] Video Prediction Labeled as a V2V process, despite taking only 1 frame #1