tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
5.08k stars 331 forks source link

cloth-id training #322

Open Logos23333 opened 6 months ago

Logos23333 commented 6 months ago

My understanding is that the input for face-id model is a triplet (face, text prompt, ground truth image), where the face is a cropped small image containing only the face extracted from the ground_truth image. So that the model can generate images of the specified face according to the text prompt. I now want to train a cloth-id model in a similar manner, where given an image of cloth, the model would generate images containing the specified cloth. My questions are:

  1. I would like to train on the basis of sdxl, should I use the tutorial_train_sdxl.py script? However, I see there is another face-id script (tutorial_train_faceid), and I am not clear if the training method for face-id is different from the regular ip-adapter, which script should I use for training?
  2. Regarding the images of cloth, I used cloth-segmentation to segment the original image, should I use image 1, image 2, or some other type of image? original image: image image1: image image2: image
xiaohu2015 commented 6 months ago

I think you can use regular ip-adapter (plus model) to train

Logos23333 commented 6 months ago

plus model

any suggestion about the image input of the ip-adapter, should I use the original image or cropped image?

xiaohu2015 commented 6 months ago

you should firstly select image encoder,I think you can use CLIP or DINO, you can resize the cloth image to 224x224

hongminpark commented 5 months ago

@Logos23333 Hi! have you tried training ? I wanna train my own clothes too.

Logos23333 commented 5 months ago

@hongminpark I attempted to train two ip-adapters using tutorial_train_sdxl.py, once initializing the weights from scratch and once fine-tuning on ip-adapter sdxl (The image encoder are also from here). However, both attempts did not yield good results, with the cloth lacking proper consistency. My experimental setup: pretrained_model_name_or_path=SG161222/RealVisXL_V4.0, resolution=512, learning_rate=1e-4, weight_decay=1e-2, num_train_epochs=10, train_batch_size=8, with a dataset size of 40k and an image resolution of 512*512. My questions:

  1. Should a different image encoder be used to extract the cloth embedding?
  2. Should the clothes be preprocessed, such as cropping the original image to retain only the clothes?
  3. Should I use the community's checkpoint as the backbone or fine-tune with the original SDXL?

@xiaohu2015 Can you help me take a look at my training setup to see if there are any issues? Or provide some suggestions, I would be very grateful.