Open ziye3001 opened 10 months ago
In LoRA training, we only drop the camera condition (which is fed into the network by class_labels
):
https://github.com/threestudio-project/threestudio/blob/8a51c37317b6f7cd74bb3cb24c975b56d0a96703/threestudio/models/guidance/stable_diffusion_vsd_guidance.py#L570-L571
So in inference, we do the same and only apply CFG on the camera condition.
Hi, I'm trying to learn your implementation of VSD loss and have a question. To get the noise with CFG, one should compute both conditioned and unconditioned noise. So why do you use
encoder_hidden_states=torch.cat([image_camera_embeddings] * 2, dim=0),
in the following two links? shouldn't it be something likeencoder_hidden_states=torch.cat( [ image_camera_embeddings, torch.zeros_like(image_camera_embeddings), ], dim=0, ),
?https://github.com/threestudio-project/threestudio/blob/8a51c37317b6f7cd74bb3cb24c975b56d0a96703/threestudio/models/guidance/stable_diffusion_vsd_guidance.py#L492C6-L492C6
https://github.com/threestudio-project/threestudio/blob/8a51c37317b6f7cd74bb3cb24c975b56d0a96703/threestudio/models/guidance/zero123_unified_guidance.py#L435C24-L435C24
Thank you very much!