Closed MLDeS closed 3 days ago
Hi, thanks for your interest. The approach that you mentioned is not straightforward because of an input mismatch: diffusion models work with noisy inputs, whereas most self-supervised learning encoders are trained on clean images. This issue is even more pronounced in latent diffusion models that we used in our experiments, which take a compressed latent image from a pretrained VAE encoder as input. Additionally, these off-the-shelf vision encoders are not designed for tasks like reconstruction or generation.
Thanks, a follow-up, do you use the same clean features from Dinov2 for every hidden layer of the DiT for alignment?
We align a single hidden state (like a hidden state after layer 8) of the diffusion transformer with dinov2.
And what you say in the paper is that you tried different single layers to perform the alignment, and this alignment is with the final encoder features of Dinov2, is that correct?
Yes.
Hi All,
Thanks for such an interesting work! I would like to understand the reasoning behind using an external model's features as target for representation alignment. Isn't it that in this process, we are trying to make the current model's representations similar to the external benchmark model, by distilling the external model's representations into the diffusion model?
If this is so, why not directly use the external model as the backbone for the intended tasks?
Also, is that the same encoder features are used for target at all hidden stages?
Thanks!