nv-tlabs / LION

Latent Point Diffusion Models for 3D Shape Generation
Other
735 stars 57 forks source link

Question about SVR #37

Closed yufeng9819 closed 1 year ago

yufeng9819 commented 1 year ago

Hi, thanks for your great work! @ZENGXH

I want to know how to implement single view reconstruction (SVR). From your supplementary materials,I know you implement voxel guided generation by fine-tuning encoder of VAE and shape interpolation by Diffuse-Denoise. So, I guess you just replace the encoder of VAE with CLIP image encoder and then training on ShapeNet to produce plausible shapes from a single view image. Is it right?

Looking forward to your reply!

ZENGXH commented 1 year ago

Yes for voxel guided generation by fine-tuning encoder of VAE and shape interpolation by Diffuse-Denoise. The SVR implementation is a bit different: We take the encoder and decoder trained on the data as usual (without conditioning input), and when training the diffusion prior, we feed the clip image embedding as conditioning input: the shape-latent prior model will take the clip embedding through AdaGN layer.

yufeng9819 commented 1 year ago

Great! Thanks for your reply.@ZENGXH

So is it means that the SVR implementation is same as text guided generation implementation? (just as code show in demo.py)

Looking forward to your reply!

ZENGXH commented 1 year ago

yes exactly. During training we using the clip image embedding. But in inference, the provided clip embedding can be either image embedding or text embedding. Since clip is trained in a way that the text embedding is close to the image embedding.

yufeng9819 commented 1 year ago

I got it!

Thanks for your kind response.