threestudio-project / threestudio

A unified framework for 3D content generation.
Apache License 2.0
5.92k stars 457 forks source link

Unconditioned noise of LoRA #298

Open ziye3001 opened 10 months ago

ziye3001 commented 10 months ago

Hi, I'm trying to learn your implementation of VSD loss and have a question. To get the noise with CFG, one should compute both conditioned and unconditioned noise. So why do you use encoder_hidden_states=torch.cat([image_camera_embeddings] * 2, dim=0), in the following two links? shouldn't it be something like encoder_hidden_states=torch.cat( [ image_camera_embeddings, torch.zeros_like(image_camera_embeddings), ], dim=0, ), ?

https://github.com/threestudio-project/threestudio/blob/8a51c37317b6f7cd74bb3cb24c975b56d0a96703/threestudio/models/guidance/stable_diffusion_vsd_guidance.py#L492C6-L492C6

https://github.com/threestudio-project/threestudio/blob/8a51c37317b6f7cd74bb3cb24c975b56d0a96703/threestudio/models/guidance/zero123_unified_guidance.py#L435C24-L435C24

Thank you very much!

bennyguo commented 9 months ago

In LoRA training, we only drop the camera condition (which is fed into the network by class_labels): https://github.com/threestudio-project/threestudio/blob/8a51c37317b6f7cd74bb3cb24c975b56d0a96703/threestudio/models/guidance/stable_diffusion_vsd_guidance.py#L570-L571 So in inference, we do the same and only apply CFG on the camera condition.