tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
4.52k stars 298 forks source link

Training an IP-Adapter from scratch upon my own model #218

Open ppj567 opened 6 months ago

ppj567 commented 6 months ago

Hi, thank you for your great work!

I tried to train an IP-Adapter upon my own Stable-Diffusion-like backbone model (for my backbone model: I slightly expand the model size of SDXL and then I well pretrain it, so it is able to synthesize high-quality images). My batchsize=250, img_size=1024x1024, lr=2e-5 (I found lr=1e-4 leads to obvious artifact in my case), and the image projection module was modified as a two-layer MLP without layernorm, which still embeds CLIP image global embedding into 4 tokens (I found adding one more MLP layer and removing layernorm could improve image quality and speed-up convergence in my case). My codebase is not built upon diffusers, so I re-implemented IP-Adapter by adding W'_k and W'_v to the standard scaled_dot_production module in cross-attention and combining the newly generated hidden_states X' with the original ones X in a scale=1.0 before linear_proj layer. In total, the IP-Adapter in my case has 61 million trainable parameters.

During training, I found the convergence of IP-Adapter was really slow, which was even slower than training my whole backbone model from scratch. Moreover, synthesizing a highly-similar variant of the input image was still struggling, more training steps (e.g., 10k additional steps) changed little in my test.

After training for about 130k steps, I found the results were not that good compared with the official IP-Adapter's results. I know the official version was trained with 1000k steps, but I also noticed that you mentioned in other issues that training for about 2-3 days (i.e., about 200k-300k steps) could achieve good results. Below are my re-implemented results (trained with image-text pairs for 130k steps), and we use no text prompt at inference stage (ip_scale=1.0, ddim solver, 50 denoising steps, guide_scale=7.5).

I would be really appreciated that you could share some useful suggestions on training upon a customized backbone model, such as the tips or findings of IP-Adapter module design, parameter tuning, convergence behaviour during training. Or is there anything wrong in my practice?

Thanks a lot!

image image image image image image

xiaohu2015 commented 6 months ago

how many training data?

ppj567 commented 6 months ago

We use much larger training data than that in the original paper.

ppj567 commented 6 months ago

how many training data?

Is it a problem that large training data will hinder the convergence of IP-Adapter, since there are only few trainable parameters compared to the backbone model.

xiaohu2015 commented 6 months ago

1) In fact, for pictures like this Pearl Girl, the model should be able to overfit, because the training data set often contains this picture. 2) for fair comparison, you can use same dataset and setting to train IP-Adapter on SD 1.5 or SDXL, if it works, I think that your implemention is almost right.

ppj567 commented 6 months ago

hi, I tried to set ip_scale=0 and feed my backbone model the exact name of this famous image: "Girl with a Pearl Earring" as the text prompt. Below are the results that I ran twice. So I think the problem maybe not lie here.

image image

ppj567 commented 6 months ago

Or is it just the normal behaviour w.r.t image prompt adherence of IP-Adapter which has been trained for 130k steps?

xiaohu2015 commented 6 months ago

hi, the IP-adapter paper give some cases training about 20W steps (Ablation Study 4.4.1), you can refer to that.

xiaohu2015 commented 6 months ago

hi, I tried to set ip_scale=0 and feed my backbone model the exact name of this famous image: "Girl with a Pearl Earring" as the text prompt. Below are the results that I ran twice. So I think the problem maybe not lie here.

image image

it seems your base model performs worse?

ppj567 commented 6 months ago

I am thinking that whether a worse base model would lead to worse IP-Adapter image prompt adherence.

ppj567 commented 6 months ago

hi, the IP-adapter paper give some cases training about 20W steps (Ablation Study 4.4.1), you can refer to that.

The image prompt adherence seem acceptable with 200K steps, better than my version.

xiaohu2015 commented 6 months ago

I am thinking that whether a worse base model would lead to worse IP-Adapter image prompt adherence.

It should, as IP-Adapter is only a lightweight adapter

ppj567 commented 6 months ago

I am thinking that whether a worse base model would lead to worse IP-Adapter image prompt adherence.

It should, as IP-Adapter is only a lightweight adapter

You mean text adherence would affect the image adherence, regardless of further training of the adapter?

xiaohu2015 commented 6 months ago

What I mean is that the capability of the basic model will be a limitation. The most obvious example is that if you apply IP-Adapter to a better model (such as the fine-tuned model on site civita), the generation quality will be improved to a certain extent.

ppj567 commented 6 months ago

The most important thing I am currently caring about, is the similarity between the generated image and the input image prompt. But till now, my results are not satisfying in this aspect. Maybe I should move to a better base model to achieve this goal, just as suggested.

songyang86 commented 6 months ago

@ppj567 can you send me a small traing data to me ? I don't know the exactly structure of traing data.Thanks

xiaohu2015 commented 6 months ago

a json file contains many text-image pairs

[{"image_file": "1.png", "text": "A dog"}]

songyang86 commented 6 months ago

@xiaohu2015 dear xiaohu2015,can you update a small traing data? So I can train my data from scratch,Thanks

songyang86 commented 6 months ago

@xiaohu2015 I want to training IP-Adapter based on FFHQ dataset,When I did face detect and crop from original image, how to generate faceid.bin.Thanks

filliptm commented 5 months ago

so my dataset structure for training an IP is simply a text caption and an image paring?

xiaohu2015 commented 5 months ago

so my dataset structure for training an IP is simply a text caption and an image paring?

yes