What the difference between this work and embedding?

If you tested some images containing bodies directly, you may get poorer results. The input faces in the paper are from the ffhq dataset, all cropped. You could preprocess with FFHQ-Alignment or cut the headshots for your test images.
InstantID is powerful, but lacks the controllability for pose and expressions. Identity embeddings in word embedding space could possess better text controllability.
The identity embeddings learned by our framework (face encoder + AdaIN with celeb space + two phase masked diffusion loss) are more aligned with the celeb name distribution (ideal identity consistency), i.e., more compatible with Stable Diffusion and its plug-and-play modules. Therefore, SD2.1-based video and 3D generation models can be seamlessly combined. In short, our learned embeddings can be used as naturally as celeb names in Stable Diffusion. You could see our inference code.
We think the text embeddings can work with Stable Diffusion more naturally. A lora might not work with plug-and-play modules.

qinghew / StableIdentity