qinghew / StableIdentity

🔥 StableIdentity: Inserting Anybody into Anywhere at First Sight
https://qinghew.github.io/StableIdentity/
MIT License
246 stars 8 forks source link

What the difference between this work and embedding? #8

Open zengjie617789 opened 1 month ago

zengjie617789 commented 1 month ago

Thanks for sharing this work firstly. I test this code with a reference code, but I got a results as not I expected. As concerned as the similariy it's far away from InstantID performance. Furthermore, I feel curious what is the innovation of this work and why not use lora training directly which has turn out much better than embedding training?

qinghew commented 1 month ago
  1. If you tested some images containing bodies directly, you may get poorer results. The input faces in the paper are from the ffhq dataset, all cropped. You could preprocess with FFHQ-Alignment or cut the headshots for your test images.
  2. InstantID is powerful, but lacks the controllability for pose and expressions. Identity embeddings in word embedding space could possess better text controllability.
  3. The identity embeddings learned by our framework (face encoder + AdaIN with celeb space + two phase masked diffusion loss) are more aligned with the celeb name distribution (ideal identity consistency), i.e., more compatible with Stable Diffusion and its plug-and-play modules. Therefore, SD2.1-based video and 3D generation models can be seamlessly combined. In short, our learned embeddings can be used as naturally as celeb names in Stable Diffusion. You could see our inference code.
  4. We think the text embeddings can work with Stable Diffusion more naturally. A lora might not work with plug-and-play modules.