questions on double_size and some blurred results

zengyh1900 commented 5 years ago

Dear authors,

I'm studying your paper and codes, thanks for sharing! I have a question that, as I know from your codes and paper, 0 indicates non-hole pixels and 1 indicates hole pixels. but why do you need to multiply masks by 0.5 when the input size is 512? demo_vi.py

Also, I have observed similar results of some cases shown in your paper, however, some cases are very blurred (especially in slow-moving cases) This is an example from DAVIS (DAVIS/bear) Is it reasonable? Or did I miss anything to get the results better?

Looking forward to your reply.

ytongW commented 4 years ago

Could you tell me how to change the size of output image? when I changed the size of input image directly, I got this error. RuntimeError: invalid argument 0: Sizes of tensors must match except in dimension 1. Got 32 and 64 in dimension 3 at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCTensorMath.cu:111 if I resize the output image, then the image will get very blurred. Thanks for your time!

AjithPanja commented 3 years ago

Yeah, I too noticed the blurry part while running the code with Bear video. I would be really grateful if you could clarify my doubt 😅. From my understanding, the known pixels from the previous and future frames are filled, but how blind spot pixels are filled? (Eg. A trashcan in the same place throughout the video, If the trashcan has to be removed how it's pixels will be filled?)

mcahny commented 3 years ago

Hi all, thanks for your interest. To answer your questions,

The double_size case was trained with the mask where the hole region is filled with the value 0.5, and non-hole regions with 1.0. There is no special reason behind this choice.

About the fixed-size hole, your results look reasonable and I can reproduce that on the bear video. My understanding on this result are based on these points:

VINet can be divided into 1) an image-level encoder-decoder network, and 2) additional reference encoders that support the target frame inpainting.
1 performs the standard image inpainting and is supposed to be able to "hallucinate" on the never visible region.
2 performs "copy-and-paste" from neighbor frames onto the target frame hole region.
While VINet is supposed to be good at both, the empirical results imply that training did not balance well between the both, and mainly focused on "copy-and-paste" learning. This would have led to poor "hallucination" performance and thus blurry results with fixed holes.

mcahny / Deep-Video-Inpainting

questions on double_size and some blurred results #17