tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
4.95k stars 319 forks source link

Performance on other models already based on sd1.5 and training dataset #19

Open Laidawang opened 1 year ago

Laidawang commented 1 year ago

This project is awesome!!! I have two small questions,

  1. Have you guys tried his tests on other models trained on sd1.5(like Realistic2.0 or anime model)? How is the performance?
  2. I don't know how to structure my training data, could you maybe show a small example?
xiaohu2015 commented 1 year ago

@Laidawang hi, the IP-Adapter only needs trained on sd1.5, but can be used on most community models. For training, you need prepare image-text pairs, and convert the data into a json file:

[
      {"text": "A dog", "image_file": "dog.jpg"},
      {"text": "A cat", "image_file": "cat.jpg"}

]
Laidawang commented 1 year ago

@xiaohu2015 , thank you for your help, so we use this image as input of clip and the ground truth, Will this limit the variety of image embbending? Because in my experiment, when the scale is high (0.9 or higher), it will basically restore the image completely, but when it is low (0.3), it will cause some empty scenes. I'm trying to use inpaiting to create a background for some small objects with this tec.

xiaohu2015 commented 1 year ago

@Laidawang you maybe adjust the scale and add some text prompts to get good results. For now, we just use same image as condition and ground truth, it maybe limit its generation ability. In addition, we are also exploring better solutions

Laidawang commented 1 year ago

@xiaohu2015 I think you use semantically consistent prompt and image during training, which will cause problems when the input image and prompt semantically inconsistent. maybe we can try such training, for example: prompt: a cat image: an empty scene, Gt: a cat in such a scene. Or vice versa: prompt: describes the scene, image: a cat, GT: the cat is in the scene, i think this separate the influence of the prompt and the input image at the embbending level.

xiaohu2015 commented 1 year ago

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

Laidawang commented 1 year ago

wow,that's really nice

Laidawang commented 1 year ago

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

in that case, How to make a dataset, can you give an example?

xiaohu2015 commented 1 year ago

@Laidawang you can detect the face in the image, and crop it.

hkunzhe commented 1 year ago

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

xiaohu2015 commented 1 year ago

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

yes.

hkunzhe commented 1 year ago

@Laidawang you are right, but building such a dataset needs a certain amount of work, of course it will make the IP-Adapter more powerful (in fact, that is in our plan). By the way, we have trained an IP-Adapter which uses face image as image prompt (https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-plus-face_demo.ipynb). during training, we use face as image condition, but the full image is GT.

That's to say, the batch["clip_image"] in the training script corresponds to the cropped image, and the batch["images"] corresponds to the full image?

yes.

Thank you for such a quick reply! I've tried the model ip-adapter-plus-face_sd15.bin, and I find it's still hard to preserve human face likeness as discussed in #5. Do you think it would be better to replace the original CLIP with a face-specified CLIP model like FaRL, or do you have a better suggestion?

xiaohu2015 commented 1 year ago

@hkunzhe I think you can make a try, for CLIP models, I found that it can only learn the similar structure of face. Hence, I think using face-specified model is more hopeful. However, my eary experiments using features from face recognition models does not work well, it is hard to training and learing using only diffusion losses.

KevinChen880723 commented 12 months ago

@xiaohu2015 Thanks for your great work! Did you pre-train a face recognition before training it with the diffusion model, or train them simultaneously? Is it possible for you to describe your previous experiments briefly? Thanks a lot for your help in advance! Have a nice day :)

JasonSongPeng commented 11 months ago

@Laidawang hi, the IP-Adapter only needs trained on sd1.5, but can be used on most community models. For training, you need prepare image-text pairs, and convert the data into a json file:


[

      {"text": "A dog", "image_file": "dog.jpg"},

      {"text": "A cat", "image_file": "cat.jpg"}

]

Dear xiaohu,

May I ask one question about the json file of training data? Is the 'text' similar to how we train Lora model? I mean, if there are many elements in my images in terms of table,chair,carpet .etc, how should I prepare the 'text'?

Looking forward your reply. Best,

ALR-alr commented 2 months ago

it means the ip-adapter was trained with " same image as condition and ground truth", but when we inference, we can use a strawberry-shape cropped image and a landscape to generate a strawberry-shape mountain (as show in readme.md)?