tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
5.21k stars 337 forks source link

Open Discussion #175

Open xiaohu2015 opened 10 months ago

xiaohu2015 commented 10 months ago

Thank you for your attention to IP-Adapter. I have opened an issue to discuss it. Anyone can express their opinions to promote further improvement of IP-Adapter.

whiterose199187 commented 10 months ago

Thanks you for your work on IP-Adapter. I was curious about the following:

  1. Is ip-adapter-full-face_sd15.bin the recommended way to use face images as input prompts? Are there plans to release a similar version for SDXL?
  2. Are there plans to support SDXL turbo?
xiaohu2015 commented 10 months ago

@whiterose199187

  1. yes, you should use cropped face images to get good results. Although the full version of the model has been improved to some extent, it still has major flaws. I will optimize sdxl after I have a better solution. And I found that training a ip-adapter-face with ID embedding from face recognition models is very helpful. face similarity can be increased from 0.4-0.5 to 0.6-0.7 in some test cases
  2. This is under consideration but has not been implemented yet ( it seems the training code of SDXL turbo is not released)
whiterose199187 commented 10 months ago

Understood

And I found that training a ip-adapter-face with ID embedding from face recognition models is very helpful. face similarity can be increased from 0.4-0.5 to 0.6-0.7 in some test cases

Can i follow progress of this anywhere? Is there some documentation which I could read and train it myself?

chuck-ma commented 10 months ago

I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios.

I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt.

I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details.

I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background.

I saw on the issue https://github.com/tencent-ailab/IP-Adapter/issues/100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated

xiaohu2015 commented 10 months ago

I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios.

I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt.

I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details.

I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background.

I saw on the issue #100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated

hi, I trained face ip-adapter with DINOV2, it can generate more consistent images compared to CLIP, and also can works with text prompts. Hence, I think you can try it.

chuck-ma commented 10 months ago

I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios. I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt. I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details. I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background. I saw on the issue #100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated

hi, I trained face ip-adapter with DINOV2, it can generate more consistent images compared to CLIP, and also can works with text prompts. Hence, I think you can try it.

Thank you, kind and lovely angel. Merry Christmas in advance.

lxd941213 commented 10 months ago

I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios. I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt. I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details. I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background. I saw on the issue #100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated

hi, I trained face ip-adapter with DINOV2, it can generate more consistent images compared to CLIP, and also can works with text prompts. Hence, I think you can try it.

Hello, I have two questions for you. 1) Did you use the last hidden state of DINOv2? 2) You mentioned the use of id embedding above. How is this added to the model? Looking forward to your reply, thank you!

xiaohu2015 commented 10 months ago
  1. yes
  2. the ID embedding from a face face recognition model, you can just add it the same as CLIP image embedding. And we may be release a model soon.
lxd941213 commented 10 months ago
  1. yes
  2. the ID embedding from a face face recognition model, you can just add it the same as CLIP image embedding. And we may be release a model soon.

Thank you very much for your reply

blistick commented 9 months ago

Hi @xiaohu2015. I've tested all of your adapters, including the most recent versions that use face id. I get the best results by far with your "full face" adapter for SD 1.5.

Therefore, I think it would be very worthwhile to try to adapt that model for SDXL. I'm hoping that you will bump this up on your priority list. (This is my wish for the new year 😃 )

Thanks for all of your work!

juancopi81 commented 9 months ago

Hi!

First of all, thanks for the great work on this project. I'm exploring the use of IPAdapterPlus with num_tokens=257 compared to IPAdapterFull. I understand that IPAdapterPlus utilizes the Resampler with a more complex architecture, which seems like it would handle the embeddings more effectively. However, I've come across some discussions suggesting that IPAdapterFull might yield better results in certain scenarios.

Could you provide more insights into the performance differences between these two approaches?

Thanks again!

shadowkkk commented 8 months ago
  1. yes
  2. the ID embedding from a face face recognition model, you can just add it the same as CLIP image embedding. And we may be release a model soon.

@xiaohu2015 Hi, thank you very much for sharing your great work! I want to use dinov2l as image encoder to train the IP-Adapter(I tried ImageProjModel/resampler as ProjModel), and i see the issues about this. However I cnt get good result.

So coucld you consider open the pretrained model based on dinov2 as the image encoder for IP-Adapter? Or open the config or training logs?

xiaohu2015 commented 8 months ago

@shadowkkk you can just use dinov2 model https://huggingface.co/facebook/dinov2-large to replace the clip model for ip-adapter-face-full

shadowkkk commented 8 months ago

@shadowkkk you can just use dinov2 model https://huggingface.co/facebook/dinov2-large to replace the clip model for ip-adapter-face-full

Thanks for your reply!

I already did!I use the pretrained model dinov2 L as image encoder, the last hidden state (only patch tokens) of dinov2 as image embed, the laionA6plus is train dataset, and ImageProjModel/Resampler as projmodel. The drop rate of image_embed and text is follow your script tutorial_train_plus.py. When after 30w steps, I couldn't get a result like your clip-IP-adapter result,. It looks very bad. I have trained multiple times but still haven't achieved good results. I wonder if there are flaws.

Coucld you give me some advices? Any help would be appreciated.

xiaohu2015 commented 8 months ago

I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?

shadowkkk commented 8 months ago

I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?

yep! I also trained for IP-Adapter model. QAQ

Maybe ... can I have your wechat(ball ball you

xiaohu2015 commented 8 months ago

I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?

yep! I also trained for IP-Adapter model. QAQ

Maybe ... can I have your wechat(ball ball you

xiaoxiaohu1994

Yunski commented 6 months ago

In general, when training ip adapter along with controlnet, and need near exact reproduction of fine details, should ip adapter do most of the heavy lifting or controlnet? Specifically, I'm curious about training cases when the image prompt only contains part of the full image (for example, image prompt = subject, full image = subject + background). Therefore, if the ip adapter only has access to some (but not all) of the full image features, should learning rate be lower for ip adapter and higher for controlnet, so most of the learning can come from controlnet, which may have more capacity to learn more of the image. Or is there a smarter way to achieve this other than learning rate tuning?

xiaohu2015 commented 6 months ago

In general, when training ip adapter along with controlnet, and need near exact reproduction of fine details, should ip adapter do most of the heavy lifting or controlnet? Specifically, I'm curious about training cases when the image prompt only contains part of the full image (for example, image prompt = subject, full image = subject + background). Therefore, if the ip adapter only has access to some (but not all) of the full image features, should learning rate be lower for ip adapter and higher for controlnet, so most of the learning can come from controlnet, which may have more capacity to learn more of the image. Or is there a smarter way to achieve this other than learning rate tuning?

I think it need some experiments

Yunski commented 6 months ago

In general, when training ip adapter along with controlnet, and need near exact reproduction of fine details, should ip adapter do most of the heavy lifting or controlnet? Specifically, I'm curious about training cases when the image prompt only contains part of the full image (for example, image prompt = subject, full image = subject + background). Therefore, if the ip adapter only has access to some (but not all) of the full image features, should learning rate be lower for ip adapter and higher for controlnet, so most of the learning can come from controlnet, which may have more capacity to learn more of the image. Or is there a smarter way to achieve this other than learning rate tuning?

I think it need some experiments

thanks. I'll try various hyperparameters.

karasu801 commented 5 months ago

@shadowkkk you can just use dinov2 model https://huggingface.co/facebook/dinov2-large to replace the clip model for ip-adapter-face-full

Thanks for your reply!

I already did!I use the pretrained model dinov2 L as image encoder, the last hidden state (only patch tokens) of dinov2 as image embed, the laionA6plus is train dataset, and ImageProjModel/Resampler as projmodel. The drop rate of image_embed and text is follow your script tutorial_train_plus.py. When after 30w steps, I couldn't get a result like your clip-IP-adapter result,. It looks very bad. I have trained multiple times but still haven't achieved good results. I wonder if there are flaws.

Coucld you give me some advices? Any help would be appreciated.

Have you figured out the reason for being unable to generate good results when switching to dinov2? I encountered the same issue as you.

zengjie617789 commented 3 months ago

the dinov2 embeding shape is (1536,2048), how do you guys to convert to (512,)?

lqfool commented 3 months ago

I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?

yep! I also trained for IP-Adapter model. QAQ

Maybe ... can I have your wechat(ball ball you

Hello, I would like to ask how your training progress is going?

flankechen commented 2 months ago

I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?

thanks for your great work again. can not find any release model with dinov2. is there any plan to release a pretrain weight of ipadapter face model with dinov2? and comparision between dinov2 and insightface faceid ?