kxhit / EscherNet

[CVPR2024 Oral] EscherNet: A Generative Model for Scalable View Synthesis
https://kxhit.github.io/EscherNet
Other
293 stars 16 forks source link

Performance Degradation with T_out set to 2 #14

Open Lizb6626 opened 2 months ago

Lizb6626 commented 2 months ago

I reduce the oputput views to 2 in the demo case and observe a significant performance drop. I wonder why performance excels with 25 views but suffers with 2. Could you provide insights or an explanation for this behavior? (I use 4DoF model) input

0

gt

kxhit commented 2 months ago

Hi @Lizb6626 Excellent question! We also observed this behavior and reported it in Appendix Fig.8. Apparently, the number of repeated output views will affect the generation, which seems saturated at around 15. The short answer is we don't know why exactly.

Some guesses are:

  1. The model is trained only with 3-to-3 training set, and the model somehow overfits to 3-to-3 setting, so the 2 outputs are slightly OOD. A better training way might be using a flexible number of pairs to train the model instead of fixed 3-to-3.
  2. During the diffusion process, the output views can talk to each other via self-attention, and this somehow stabilizes the generation process, making the results more stable.

Happy to discuss!

Lizb6626 commented 2 months ago

Thank you for your reply.

I have doubts about the OOD assumption. Since in the 15-target views scenario, where OOD is present, the model achieves optimal results. Or does OOD only occur when there are fewer than 3 target views?

In the appendix, you mentioned that "duplicates with identical target camera poses." I'm interested in how this aligns with the idea that "multiple target views reduce inherent stochasticity". Since identical target views do not offer additional information, they may not reduce inherent stochasticity. Additionally, since identical target views have similar QKV values in the self-attention module, talking to one is equal to talking to all of them due to the attention mechanism.

I experimented by repeating target views 8 times to yield 16 outputs, but the results did not show significant improvement.

0

kxhit commented 2 months ago

Hi @Lizb6626 Thanks for your feedback. Your argument makes sense to me, but each novel view does not start from the same random noise. Will this help the generation when different noises at the same novel view should align with each other during the attention?

The results you are showing here is 1-to-16(2 novel poses), right? Could you try different input views e.g., 2-to-2 vs 2-to-16 (2 novel poses), or 3, 5 input views? The model generally works much better when input views> = 2. And let me know, with the fixed number of input views, but more repeat output views, if the generation improves.