roykapon / MAS

The official implementation of the paper "MAS: Multiview Ancestral Sampling for 3D Motion Generation Using 2D Diffusion"
MIT License
97 stars 3 forks source link

How Do Diffusion Models Maintain Consistency Across Views After Denoising Without Specific Conditions? #8

Closed 2019211753 closed 3 months ago

2019211753 commented 5 months ago

I understand that the input noise across different views remains consistent due to its origin from 3D noise projection. However, I'm curious about how this consistency is maintained after the denoising step. How does it continue to represent the same motion and perspective? I assumed that training a diffusion model would require specifying conditions such as the viewing angle, but according to the paper, it's done unconditionally. Thank you for your clarification.

roykapon commented 4 months ago

Hi there @2019211753 ! This is exactly the beauty of the method! It does not require viewing angle conditioning. What keeps all views consistent with each other is our consistency block, which takes the motion predictions in each stage (which are not necessarily multiview-consistent), triangulates them into a single 3D motion, then projects them back to all views and replaces the original predictions with the projections. The projections are used in the diffusion process in each view so the entire process remains multiview-consistent. The 3D noise provides a crucial boost to the coordination between all views, but the consistency block is the one that keeps all views together.