sihyun-yu / REPA

Official Pytorch Implementation of Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
https://sihyun.me/REPA
MIT License
473 stars 20 forks source link

Rationale behind using an external model for alignment and loss optimization #6

Closed MLDeS closed 3 days ago

MLDeS commented 3 days ago

Hi All,

Thanks for such an interesting work! I would like to understand the reasoning behind using an external model's features as target for representation alignment. Isn't it that in this process, we are trying to make the current model's representations similar to the external benchmark model, by distilling the external model's representations into the diffusion model?

If this is so, why not directly use the external model as the backbone for the intended tasks?

Also, is that the same encoder features are used for target at all hidden stages?

Thanks!

sihyun-yu commented 3 days ago

Hi, thanks for your interest. The approach that you mentioned is not straightforward because of an input mismatch: diffusion models work with noisy inputs, whereas most self-supervised learning encoders are trained on clean images. This issue is even more pronounced in latent diffusion models that we used in our experiments, which take a compressed latent image from a pretrained VAE encoder as input. Additionally, these off-the-shelf vision encoders are not designed for tasks like reconstruction or generation.

MLDeS commented 3 days ago

Thanks, a follow-up, do you use the same clean features from Dinov2 for every hidden layer of the DiT for alignment?

sihyun-yu commented 3 days ago

We align a single hidden state (like a hidden state after layer 8) of the diffusion transformer with dinov2.

MLDeS commented 3 days ago

And what you say in the paper is that you tried different single layers to perform the alignment, and this alignment is with the final encoder features of Dinov2, is that correct?

sihyun-yu commented 3 days ago

Yes.