tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
5.33k stars 337 forks source link

IP-Adapter as Viton? #261

Open Bilal143260 opened 10 months ago

Bilal143260 commented 10 months ago

Is it a good idea to train IP-Adapter to act like a viton (virtual try-on)? The training data would include images of cloth and prompts as input and a model wearing that dress as ground truth.

xiaohu2015 commented 10 months ago

I think it should, maybe combining controlnet will be better

Bilal143260 commented 10 months ago

Thanks for the reply @xiaohu2015 . Could you please provide any reference material for training IP-adapter with control-net?

xiaohu2015 commented 10 months ago

you can refer to https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py + https://github.com/tencent-ailab/IP-Adapter/blob/main/tutorial_train.py

Bilal143260 commented 10 months ago

For reference, I tried training Pix-to-pix-instruct, that didn't workout. I also trained a open-pose control net but it doesn't capture the details of dresses. Now, I have to engineer that how to combine your training script with that of controlnet script. Any collaborations, suggestions or maybe a paid consultancy is highly appreciated.

xiaohu2015 commented 10 months ago

For reference, I tried training Pix-to-pix-instruct, that didn't workout. I also trained a open-pose control net but it doesn't capture the details of dresses. Now, I have to engineer that how to combine your training script with that of controlnet script. Any collaborations, suggestions or maybe a paid consultancy is highly appreciated.

a good start is using openpose controlnet (but replace condition with image condition) + IP-Adapter (with image condition)

for cloth image condition, I think you can use dino to extract image features

Bilal143260 commented 10 months ago

So what you mean is I train the stable diffusion using IP-adapter (on viton dataset) and then during inference I use open pose control net and add cloth image condition in it (maybe a clothing mask from dino etc) + IP adapter. Correct me if I am wrong?

xiaohu2015 commented 10 months ago

So what you mean is I train the stable diffusion using IP-adapter (on viton dataset) and then during inference I use open pose control net and add cloth image condition in it (maybe a clothing mask from dino etc) + IP adapter. Correct me if I am wrong?

I mean you train a IP-Adapter + contrlonet together

Bilal143260 commented 10 months ago

Thank you for all suggestions. I will give it a shot.

dxposmovon commented 10 months ago

So what you mean is I train the stable diffusion using IP-adapter (on viton dataset) and then during inference I use open pose control net and add cloth image condition in it (maybe a clothing mask from dino etc) + IP adapter. Correct me if I am wrong?

I mean you train a IP-Adapter + contrlonet together

Do you mean training an IP-adapter and a new openpose controlnet at the same time? Or just train the IP-adapter and fix the controlnet?

xiaohu2015 commented 10 months ago

training together

hosjiu1702 commented 4 months ago

For reference, I tried training Pix-to-pix-instruct, that didn't workout. I also trained a open-pose control net but it doesn't capture the details of dresses. Now, I have to engineer that how to combine your training script with that of controlnet script. Any collaborations, suggestions or maybe a paid consultancy is highly appreciated.

a good start is using openpose controlnet (but replace condition with image condition) + IP-Adapter (with image condition)

for cloth image condition, I think you can use dino to extract image features

If so, we should change the Image Encoder of the IP-Adapter from CLIP to, for instance, DINOv2? @xiaohu2015 like this