Open xiaohu2015 opened 10 months ago
Thanks you for your work on IP-Adapter. I was curious about the following:
@whiterose199187
Understood
And I found that training a ip-adapter-face with ID embedding from face recognition models is very helpful. face similarity can be increased from 0.4-0.5 to 0.6-0.7 in some test cases
Can i follow progress of this anywhere? Is there some documentation which I could read and train it myself?
I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios.
I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt.
I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details.
I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background.
I saw on the issue https://github.com/tencent-ailab/IP-Adapter/issues/100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated
I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios.
I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt.
I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details.
I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background.
I saw on the issue #100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated
hi, I trained face ip-adapter with DINOV2, it can generate more consistent images compared to CLIP, and also can works with text prompts. Hence, I think you can try it.
I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios. I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt. I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details. I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background. I saw on the issue #100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated
hi, I trained face ip-adapter with DINOV2, it can generate more consistent images compared to CLIP, and also can works with text prompts. Hence, I think you can try it.
Thank you, kind and lovely angel. Merry Christmas in advance.
I really appreciate your excellent work. I'm wondering if using DINOv2 instead of CLIP would be more advantageous in certain scenarios. I encountered this requirement in my application scenario: I need to be able to realistically restore all the details of an object, even with a text prompt. I would like to use DINOv2 instead of CLIP as the image encoder for the following reasons: Recently, I read the paper 'Animate Anyone,' where it was mentioned that the training objective of the CLIP image encoder is to match with text embeddings. As a result, it may produce abstract and macro features, neglecting many small details. Additionally, the embedding obtained from the CLIP image encoder might not be large enough, potentially overlooking many details. I am planning to implement my idea based on your ipadapter-full implementation. I will use DINOV2 as the image encoder to generate the embedding (including the cls token and patch token). Then, I will use an MLP to extract features from the embedding.As for data preprocessing: simply segment the object and remove background. I saw on the issue #100 that using DINOV2 can cause text prompts to not work properly. I'm wondering if my implementation above would also have this problem, or if there are any other design flaws. Any help would be appreciated
hi, I trained face ip-adapter with DINOV2, it can generate more consistent images compared to CLIP, and also can works with text prompts. Hence, I think you can try it.
Hello, I have two questions for you. 1) Did you use the last hidden state of DINOv2? 2) You mentioned the use of id embedding above. How is this added to the model? Looking forward to your reply, thank you!
- yes
- the ID embedding from a face face recognition model, you can just add it the same as CLIP image embedding. And we may be release a model soon.
Thank you very much for your reply
Hi @xiaohu2015. I've tested all of your adapters, including the most recent versions that use face id. I get the best results by far with your "full face" adapter for SD 1.5.
Therefore, I think it would be very worthwhile to try to adapt that model for SDXL. I'm hoping that you will bump this up on your priority list. (This is my wish for the new year 😃 )
Thanks for all of your work!
Hi!
First of all, thanks for the great work on this project. I'm exploring the use of IPAdapterPlus with num_tokens=257
compared to IPAdapterFull. I understand that IPAdapterPlus utilizes the Resampler with a more complex architecture, which seems like it would handle the embeddings more effectively. However, I've come across some discussions suggesting that IPAdapterFull might yield better results in certain scenarios.
Could you provide more insights into the performance differences between these two approaches?
Thanks again!
- yes
- the ID embedding from a face face recognition model, you can just add it the same as CLIP image embedding. And we may be release a model soon.
@xiaohu2015 Hi, thank you very much for sharing your great work! I want to use dinov2l as image encoder to train the IP-Adapter(I tried ImageProjModel/resampler as ProjModel), and i see the issues about this. However I cnt get good result.
So coucld you consider open the pretrained model based on dinov2 as the image encoder for IP-Adapter? Or open the config or training logs?
@shadowkkk you can just use dinov2 model https://huggingface.co/facebook/dinov2-large to replace the clip model for ip-adapter-face-full
@shadowkkk you can just use dinov2 model https://huggingface.co/facebook/dinov2-large to replace the clip model for ip-adapter-face-full
Thanks for your reply!
I already did!I use the pretrained model dinov2 L as image encoder, the last hidden state (only patch tokens) of dinov2 as image embed, the laionA6plus is train dataset, and ImageProjModel/Resampler as projmodel. The drop rate of image_embed and text is follow your script tutorial_train_plus.py. When after 30w steps, I couldn't get a result like your clip-IP-adapter result,. It looks very bad. I have trained multiple times but still haven't achieved good results. I wonder if there are flaws.
Coucld you give me some advices? Any help would be appreciated.
I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?
I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?
yep! I also trained for IP-Adapter model. QAQ
Maybe ... can I have your wechat(ball ball you
I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?
yep! I also trained for IP-Adapter model. QAQ
Maybe ... can I have your wechat(ball ball you
xiaoxiaohu1994
In general, when training ip adapter along with controlnet, and need near exact reproduction of fine details, should ip adapter do most of the heavy lifting or controlnet? Specifically, I'm curious about training cases when the image prompt only contains part of the full image (for example, image prompt = subject, full image = subject + background). Therefore, if the ip adapter only has access to some (but not all) of the full image features, should learning rate be lower for ip adapter and higher for controlnet, so most of the learning can come from controlnet, which may have more capacity to learn more of the image. Or is there a smarter way to achieve this other than learning rate tuning?
In general, when training ip adapter along with controlnet, and need near exact reproduction of fine details, should ip adapter do most of the heavy lifting or controlnet? Specifically, I'm curious about training cases when the image prompt only contains part of the full image (for example, image prompt = subject, full image = subject + background). Therefore, if the ip adapter only has access to some (but not all) of the full image features, should learning rate be lower for ip adapter and higher for controlnet, so most of the learning can come from controlnet, which may have more capacity to learn more of the image. Or is there a smarter way to achieve this other than learning rate tuning?
I think it need some experiments
In general, when training ip adapter along with controlnet, and need near exact reproduction of fine details, should ip adapter do most of the heavy lifting or controlnet? Specifically, I'm curious about training cases when the image prompt only contains part of the full image (for example, image prompt = subject, full image = subject + background). Therefore, if the ip adapter only has access to some (but not all) of the full image features, should learning rate be lower for ip adapter and higher for controlnet, so most of the learning can come from controlnet, which may have more capacity to learn more of the image. Or is there a smarter way to achieve this other than learning rate tuning?
I think it need some experiments
thanks. I'll try various hyperparameters.
@shadowkkk you can just use dinov2 model https://huggingface.co/facebook/dinov2-large to replace the clip model for ip-adapter-face-full
Thanks for your reply!
I already did!I use the pretrained model dinov2 L as image encoder, the last hidden state (only patch tokens) of dinov2 as image embed, the laionA6plus is train dataset, and ImageProjModel/Resampler as projmodel. The drop rate of image_embed and text is follow your script tutorial_train_plus.py. When after 30w steps, I couldn't get a result like your clip-IP-adapter result,. It looks very bad. I have trained multiple times but still haven't achieved good results. I wonder if there are flaws.
Coucld you give me some advices? Any help would be appreciated.
Have you figured out the reason for being unable to generate good results when switching to dinov2? I encountered the same issue as you.
the dinov2 embeding shape is (1536,2048), how do you guys to convert to (512,)?
I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?
yep! I also trained for IP-Adapter model. QAQ
Maybe ... can I have your wechat(ball ball you
Hello, I would like to ask how your training progress is going?
I only use dino model to train IP-Adapter-face model. You trained for IP-Adapter-Plus model?
thanks for your great work again. can not find any release model with dinov2. is there any plan to release a pretrain weight of ipadapter face model with dinov2? and comparision between dinov2 and insightface faceid ?
Thank you for your attention to IP-Adapter. I have opened an issue to discuss it. Anyone can express their opinions to promote further improvement of IP-Adapter.