Thanks to the authors for the excellent work. I have some confusion: In the training code, the spatial volume is obtained from images of multiple viewpoints, whereas in the inference code, the spatial volume is obtained from multiple noise vectors. Why does the spatial volume derived from noise work during the inference process?
Thanks to the authors for the excellent work. I have some confusion: In the training code, the spatial volume is obtained from images of multiple viewpoints, whereas in the inference code, the spatial volume is obtained from multiple noise vectors. Why does the spatial volume derived from noise work during the inference process?