Open Lizb6626 opened 3 months ago
Hi @Lizb6626 Excellent question! We also observed this behavior and reported it in Appendix Fig.8. Apparently, the number of repeated output views will affect the generation, which seems saturated at around 15. The short answer is we don't know why exactly.
Some guesses are:
Happy to discuss!
Thank you for your reply.
I have doubts about the OOD assumption. Since in the 15-target views scenario, where OOD is present, the model achieves optimal results. Or does OOD only occur when there are fewer than 3 target views?
In the appendix, you mentioned that "duplicates with identical target camera poses." I'm interested in how this aligns with the idea that "multiple target views reduce inherent stochasticity". Since identical target views do not offer additional information, they may not reduce inherent stochasticity. Additionally, since identical target views have similar QKV values in the self-attention module, talking to one is equal to talking to all of them due to the attention mechanism.
I experimented by repeating target views 8 times to yield 16 outputs, but the results did not show significant improvement.
Hi @Lizb6626 Thanks for your feedback. Your argument makes sense to me, but each novel view does not start from the same random noise. Will this help the generation when different noises at the same novel view should align with each other during the attention?
The results you are showing here is 1-to-16(2 novel poses), right? Could you try different input views e.g., 2-to-2 vs 2-to-16 (2 novel poses), or 3, 5 input views? The model generally works much better when input views> = 2. And let me know, with the fixed number of input views, but more repeat output views, if the generation improves.
I reduce the oputput views to 2 in the demo case and observe a significant performance drop. I wonder why performance excels with 25 views but suffers with 2. Could you provide insights or an explanation for this behavior? (I use 4DoF model)