tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
5.02k stars 325 forks source link

Preserving human face likeness #5

Open blistick opened 1 year ago

blistick commented 1 year ago

First off, congratulations on this project and thank you so much for your work!

I'm testing using your SDXL demo, and am generally getting good results. However, my use case is really for human "personalization", like DreamBooth. I've tried using your multi-model prompt with some of my images and the likeness of the face is not quite what I'd like, meaning that the face shows variation where I wish it would be more true to the original input.

Do you have any suggestions on settings or values that I could tweak to try and improve this? Or other ideas?

Again, thanks for everything!

haofanwang commented 1 year ago

I don't think it can. This is highly restricted by CLIP embedding.

zhangjun001 commented 1 year ago

The point from @haofanwang is correct. High-level CLIP image embedding may not be able to extract enough facial details for "personalization". Although limited by the above problem, we can still enhance the face by appropriately increasing the proportion of the face in the image. Nevertheless, the capability is still limited by the input size (224*224) of CLIP image encoder. As we discussed in our paper, "it can only generate images that resemble the reference images in content and style." DreamBooth or other fine-tuning techniques can encode the common attributes (e.g., face) of training images to the network. We'll also continue to look at how we can improve to ensure person/object consistency.

blistick commented 1 year ago

@zhangjun001 thank you for your reply, and again for your fine work. I will try making the subject face as large as possible to see if I get achieve better likeness, and I also look forward to future improvements.

scarbain commented 1 year ago

The point from @haofanwang is correct. High-level CLIP image embedding may not be able to extract enough facial details for "personalization". Although limited by the above problem, we can still enhance the face by appropriately increasing the proportion of the face in the image. Nevertheless, the capability is still limited by the input size (224*224) of CLIP image encoder. As we discussed in our paper, "it can only generate images that resemble the reference images in content and style." DreamBooth or other fine-tuning techniques can encode the common attributes (e.g., face) of training images to the network. We'll also continue to look at how we can improve to ensure person/object consistency.

@zhangjun001 What about using embeddings from a facial recognition model instead of clip embeddings ? Like in this paper they used embeddings from ArcFace : https://github.com/junshutang/3DFaceShop

I've tried this technique with ControlNet, removing the last two fully connected layers from the VGGFace in order to get spatial facial feature maps. I've modified the ControlNet tiny network and convolution layers to accomodate with the new input shape but without any success. The model doesn't seem to be learning anything but it might be because of the architecture I've chosen for the tiny network which might be really not good (I'm a beginner at ML) and/or not enough training. I'm currently at 25K steps which is not that much but from my previous experiences training with ControlNet, I always could see some convergence way before that much steps.

Do you think it could work with your architecture ?

zhangjun001 commented 1 year ago

@scarbain I think the face-specific embeddings should favor the face generation. However, if the embeddings are not well aligned to the text space, it will require more resources for training (more training time) and alignment (more parameters may be needed before cross attention).

jiqizaisikao commented 1 year ago

WechatIMG54 WechatIMG55 这是我制作的换影CP 微信小程序,用一张图片 一键换头 换脸型 希望能找到合作伙伴共同研究共同开发 如果有兴趣可以测试下

jiqizaisikao commented 1 year ago

这是我自己模型训练的结果,类似于Adobe的生成填充,微信最新的<换影CP>小程序,根据一张照片,一个模版,进行一键换头,一键换脸型。不过需要不 小的算力进行训练,我可能没有大量资源进行训练优化了,希望朋友可以找到进行合作研究和开发

ykk648 commented 1 year ago

@jiqizaisikao 跟face swap比的优势在哪里

jiqizaisikao commented 1 year ago

@jiqizaisikao 跟face swap比的优势在哪里

这是一键换头,换脸型,不是一键换五官,换脸,某种程度上是很大区别

hkunzhe commented 1 year ago

这是我自己模型训练的结果,类似于Adobe的生成填充,微信最新的<换影CP>小程序,根据一张照片,一个模版,进行一键换头,一键换脸型。不过需要不 小的算力进行训练,我可能没有大量资源进行训练优化了,希望朋友可以找到进行合作研究和开发

这个和妙鸭相比是 不需要用户上传多张人像照片吗?你的训练指的是人脸 LoRA 吗?

jiqizaisikao commented 1 year ago

只需要一张图片,训练指的是从单张图片作为参考进行合成

h3clikejava commented 1 year ago

阿里免费开源了一套

kilimchoi commented 9 months ago

WechatIMG54 WechatIMG55 这是我制作的换影CP 微信小程序,用一张图片 一键换头 换脸型 希望能找到合作伙伴共同研究共同开发 如果有兴趣可以测试下

would you be able to share how you did it with a single reference image?