Some questions about image encoder and reference images

kxhit / EscherNet

[CVPR2024 Oral] EscherNet: A Generative Model for Scalable View Synthesis

Other

285 stars 16 forks source link

Thanks for your nice work. I'm still confused by the choice of ConvNeXt2 as the image encoder in this project. It is mentioned that the reason for employing ConvNeXt2 is because the frozen CLIP can only accept one reference image and only extract high-level semantic features.

I want to know:

For instance, IP-Adapter also utilizes pretrained CLIP as an image encoder, yet it is capable of accepting multiple images as conditions. So how to understand ConvNeXt2 can adapt to the input of multiple reference images? (In the code, it appears that multiple images are concatenated together as a tensor of shape [N, C, H, W], Why can't CLIP utilize a similar approach？)
The conclusion that ConvNeXt2 can extract both high-level and low-level features compared to CLIP, where is this derived from? Or is the rationale for choosing ConvNeXt2 merely because it is lightweight enough to be finetuned during the training process?
When multiple reference views are encoded as encoder hidden states and injected into the cross-attention mechanism to promote reference-to-target consistency, is it necessary for each reference view to share a field of view with the target view? This is easily achievable in object generation tasks, but in scene-level generation, the increased range of camera movement means that not all reference views may share a field of view with a particular target view. These images may not provide useful information for the generation of the target view. Therefore, does the Eschernet's approach of promoting refer-to-target consistency through cross-attention still work in such scenarios? Or do you have any suggestions?

Looking forward to your reply.

CLIP can accept multiple reference images. But since CLIP is trained to extract semantics, frozen CLIP mainly produces semantic information. That's the main motivation that Zero-1-to-3 and other methods (relying on single image input) require to concat the reference image in the input channel dimension as well, as a way to include the low-level information missed in CLIP. In Eschernet, we are handling multiple reference images, so we cannot concat multiple images directly in the input channel, so instead we directly fine-tune the image encoder to hopefully mitigate the issue. But directly fine-tuning CLIP is too expensive so eventually we swap it to ConvNextv2.
It's mainly the decision of finding a light-weight encoder. We are in a GPU-poor lab.
Good question. In wider 6DoF camera trajectory, it's possible that some reference images are completely useless in generating novel views. An ideal case is to only keep K closest view (computed via intersected field of view), as a way to also improve inference efficiency. But the intersected field of view is probably not straightforward to implement. We are considering it as a important future work. And do let us know if you have any quick solution to solve this.

kxhit / EscherNet

Some questions about image encoder and reference images #18