Open ppj567 opened 6 months ago
how many training data?
We use much larger training data than that in the original paper.
how many training data?
Is it a problem that large training data will hinder the convergence of IP-Adapter, since there are only few trainable parameters compared to the backbone model.
1) In fact, for pictures like this Pearl Girl, the model should be able to overfit, because the training data set often contains this picture. 2) for fair comparison, you can use same dataset and setting to train IP-Adapter on SD 1.5 or SDXL, if it works, I think that your implemention is almost right.
hi, I tried to set ip_scale=0 and feed my backbone model the exact name of this famous image: "Girl with a Pearl Earring" as the text prompt. Below are the results that I ran twice. So I think the problem maybe not lie here.
Or is it just the normal behaviour w.r.t image prompt adherence of IP-Adapter which has been trained for 130k steps?
hi, the IP-adapter paper give some cases training about 20W steps (Ablation Study 4.4.1), you can refer to that.
hi, I tried to set ip_scale=0 and feed my backbone model the exact name of this famous image: "Girl with a Pearl Earring" as the text prompt. Below are the results that I ran twice. So I think the problem maybe not lie here.
![]()
it seems your base model performs worse?
I am thinking that whether a worse base model would lead to worse IP-Adapter image prompt adherence.
hi, the IP-adapter paper give some cases training about 20W steps (Ablation Study 4.4.1), you can refer to that.
The image prompt adherence seem acceptable with 200K steps, better than my version.
I am thinking that whether a worse base model would lead to worse IP-Adapter image prompt adherence.
It should, as IP-Adapter is only a lightweight adapter
I am thinking that whether a worse base model would lead to worse IP-Adapter image prompt adherence.
It should, as IP-Adapter is only a lightweight adapter
You mean text adherence would affect the image adherence, regardless of further training of the adapter?
What I mean is that the capability of the basic model will be a limitation. The most obvious example is that if you apply IP-Adapter to a better model (such as the fine-tuned model on site civita), the generation quality will be improved to a certain extent.
The most important thing I am currently caring about, is the similarity between the generated image and the input image prompt. But till now, my results are not satisfying in this aspect. Maybe I should move to a better base model to achieve this goal, just as suggested.
@ppj567 can you send me a small traing data to me ? I don't know the exactly structure of traing data.Thanks
a json file contains many text-image pairs
[{"image_file": "1.png", "text": "A dog"}]
@xiaohu2015 dear xiaohu2015,can you update a small traing data? So I can train my data from scratch,Thanks
@xiaohu2015 I want to training IP-Adapter based on FFHQ dataset,When I did face detect and crop from original image, how to generate faceid.bin.Thanks
so my dataset structure for training an IP is simply a text caption and an image paring?
so my dataset structure for training an IP is simply a text caption and an image paring?
yes
Hi, thank you for your great work!
I tried to train an IP-Adapter upon my own Stable-Diffusion-like backbone model (for my backbone model: I slightly expand the model size of SDXL and then I well pretrain it, so it is able to synthesize high-quality images). My batchsize=250, img_size=1024x1024, lr=2e-5 (I found lr=1e-4 leads to obvious artifact in my case), and the image projection module was modified as a two-layer MLP without layernorm, which still embeds CLIP image global embedding into 4 tokens (I found adding one more MLP layer and removing layernorm could improve image quality and speed-up convergence in my case). My codebase is not built upon diffusers, so I re-implemented IP-Adapter by adding W'_k and W'_v to the standard scaled_dot_production module in cross-attention and combining the newly generated hidden_states X' with the original ones X in a scale=1.0 before linear_proj layer. In total, the IP-Adapter in my case has 61 million trainable parameters.
During training, I found the convergence of IP-Adapter was really slow, which was even slower than training my whole backbone model from scratch. Moreover, synthesizing a highly-similar variant of the input image was still struggling, more training steps (e.g., 10k additional steps) changed little in my test.
After training for about 130k steps, I found the results were not that good compared with the official IP-Adapter's results. I know the official version was trained with 1000k steps, but I also noticed that you mentioned in other issues that training for about 2-3 days (i.e., about 200k-300k steps) could achieve good results. Below are my re-implemented results (trained with image-text pairs for 130k steps), and we use no text prompt at inference stage (ip_scale=1.0, ddim solver, 50 denoising steps, guide_scale=7.5).
I would be really appreciated that you could share some useful suggestions on training upon a customized backbone model, such as the tips or findings of IP-Adapter module design, parameter tuning, convergence behaviour during training. Or is there anything wrong in my practice?
Thanks a lot!