tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.

Apache License 2.0

5.27k stars 336 forks source link

ip-adapter-full-face_sd15.bin details #140

Open whiterose199187 opened 1 year ago

whiterose199187 commented 1 year ago

Hello,

I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too?

Thanks

xiaohu2015 commented 1 year ago

Hello,

I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too?

Thanks

data: we remove some small faces and do some crop augmentions.
data preprocessing: we segment the face and remove background.
model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing https://github.com/tencent-ailab/IP-Adapter/pull/135#issuecomment-1803437109

whiterose199187 commented 1 year ago

can it be used with SDXL ?

xiaohu2015 commented 1 year ago

can it be used with SDXL ?

not available

whiterose199187 commented 1 year ago

thanks for the quick response, are there any plans to release SDXL version in the future?

ShungJhon commented 1 year ago

用 ip-adapter-full-face_sd15.pth 替换了ip-adapter-plus-face_sd15.pth 在webui里，会报错，有什么头绪吗？（之前是直接把bin重命名为pth可以用）

xiaohu2015 commented 1 year ago

用 ip-adapter-full-face_sd15.pth 替换了ip-adapter-plus-face_sd15.pth 在webui里，会报错，有什么头绪吗？（之前是直接把bin重命名为pth可以用）

it current not support

eezywu commented 1 year ago

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

@xiaohu2015 Hello, I have two questions for these modifications.

did you also remove the human neck and hair when removing the background? which part of face do you still remain
did you compare the effect of 257 tokens with a simple MLP to 257 tokens with the origin resampler?

Thanks!

xiaohu2015 commented 1 year ago

@eezywu (1) no, we only remove the background. but I also trained a model with only conditioned on segmented face (no fair), it can also works well. (2) the new version will always get better results (we use face id similarity to evaluate)

yaoyuan13 commented 1 year ago

@eezywu (1) no, we only remove the background. but I also trained a model with only conditioned on segmented face (no fair), it can also works well. (2) the new version will always get better results (we use face id similarity to evaluate)

hi, I saw the generation setting of plus-face with non-square size, i.e., height 704 and width 512, did you train the model with this output size or still use 512x512.

xiaohu2015 commented 1 year ago

@eezywu (1) no, we only remove the background. but I also trained a model with only conditioned on segmented face (no fair), it can also works well. (2) the new version will always get better results (we use face id similarity to evaluate)

hi, I saw the generation setting of plus-face with non-square size, i.e., height 704 and width 512, did you train the model with this output size or still use 512x512.

I trained the model on sd 1.5 with fixed 512x512 resolution. But if the base UNet model can generate non-square size images, the model also works well. By the way, I also tried to finetune the model with multi-scale, but it no improvement

yaoyuan13 commented 1 year ago

@eezywu (1) no, we only remove the background. but I also trained a model with only conditioned on segmented face (no fair), it can also works well. (2) the new version will always get better results (we use face id similarity to evaluate)

hi, I saw the generation setting of plus-face with non-square size, i.e., height 704 and width 512, did you train the model with this output size or still use 512x512.

I trained the model on sd 1.5 with fixed 512x512 resolution. But if the base UNet model can generate non-square size images, the model also works well. By the way, I also tried to finetune the model with multi-scale, but it no improvement

Got it, thanks.

eezywu commented 1 year ago

@eezywu (1) no, we only remove the background. but I also trained a model with only conditioned on segmented face (no fair), it can also works well. (2) the new version will always get better results (we use face id similarity to evaluate)

got it, thanks for your reply :)

h3clikejava commented 1 year ago

@xiaohu2015 Can you share how was "Use full tokes and use a simple MLP to get face features." achieved? Is this part not open-sourced yet? My tests have found that full-face performs much better than plus-face. and then I tried training ip-adapter-face with my own data, cutting out the face and background, and it indeed works better than the general full-face approach. However, I would like to try modifying the model like you did to further improve the results. Thx

xiaohu2015 commented 1 year ago

@xiaohu2015 Can you share how was "Use full tokes and use a simple MLP to get face features." achieved? Is this part not open-sourced yet? My tests have found that full-face performs much better than plus-face. and then I tried training ip-adapter-face with my own data, cutting out the face and background, and it indeed works better than the general full-face approach. However, I would like to try modifying the model like you did to further improve the results. Thx

the traning code just same as https://github.com/tencent-ailab/IP-Adapter/blob/main/tutorial_train_plus.py. Only two changes: (1) conditioned image is face image (2) ImageProj switch to a MLP

Laidawang commented 12 months ago

I would like to ask additionally, for faces, how large an area do we need to obtain, and which face detection model do you recommend? @xiaohu2015

xiaohu2015 commented 12 months ago

I would like to ask additionally, for faces, how large an area do we need to obtain, and which face detection model do you recommend? @xiaohu2015

https://github.com/tencent-ailab/IP-Adapter/issues/54

Nuyoah13 commented 11 months ago

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

xiaohu2015 commented 11 months ago

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

americanexplorer13 commented 11 months ago

Is it possible to use IP Adapter face embeddings for checking similiary between two faces?

zzzzzero commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images.

I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details.

My training code is based on the tutorial_train_plus with only the following modifications:

ip-adapter-plus

clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full

face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus

image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full

image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

xiaohu2015 commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images.

I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details.

My training code is based on the tutorial_train_plus with only the following modifications: #ip-adapter-plus clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

do you tested my face demo? if it performs better than my model?

zzzzzero commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images. I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details. My training code is based on the tutorial_train_plus with only the following modifications: #ip-adapter-plus clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

do you tested my face demo? if it performs better than my model?

I tested the results, and the generated face IDs by the model at every 10,000 steps are inconsistent. Therefore, I believe my training has failed. Additionally, when I set the text prompt to an empty string and input the cropped facial images, a portion of the model fails to generate images containing faces.

xiaohu2015 commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images. I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details. My training code is based on the tutorial_train_plus with only the following modifications: #ip-adapter-plus clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

do you tested my face demo? if it performs better than my model?

I tested the results, and the generated face IDs by the model at every 10,000 steps are inconsistent. Therefore, I believe my training has failed. Additionally, when I set the text prompt to an empty string and input the cropped facial images, a portion of the model fails to generate images containing faces.

I think you can make a comparison with my model https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-full-face_demo.ipynb

zzzzzero commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images. I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details. My training code is based on the tutorial_train_plus with only the following modifications: #ip-adapter-plus clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

do you tested my face demo? if it performs better than my model?

I tested the results, and the generated face IDs by the model at every 10,000 steps are inconsistent. Therefore, I believe my training has failed. Additionally, when I set the text prompt to an empty string and input the cropped facial images, a portion of the model fails to generate images containing faces.

I think you can make a comparison with my model https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-full-face_demo.ipynb

Sorry，I just discovered a bug in my code where I applied transform to the face images twice. By the way, I'd like to confirm if my data processing flow and training details are correct?

zzzzzero commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images. I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details. My training code is based on the tutorial_train_plus with only the following modifications: #ip-adapter-plus clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

do you tested my face demo? if it performs better than my model?

I tested the results, and the generated face IDs by the model at every 10,000 steps are inconsistent. Therefore, I believe my training has failed. Additionally, when I set the text prompt to an empty string and input the cropped facial images, a portion of the model fails to generate images containing faces.

I think you can make a comparison with my model https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-full-face_demo.ipynb

Sorry，I just discovered a bug in my code where I applied transform to the face images twice. By the way, I'd like to confirm if my data processing flow and training details are correct?

Initially, I only used cropped facial images with an empty text prompt, attempting to reconstruct the cropped faces solely based on the image features extracted by CLIP. However, I found that as the training progressed, the reconstruction performance improved initially but then deteriorated. It was consistently challenging to fully preserve identity features, with only intermediate results exhibiting relatively high similarity. Have you conducted similar experiments before?

xiaohu2015 commented 11 months ago

I attempted to replicate the training process of the ip-adapter-full-face, but I encountered failure. For training data, I utilized 750,000 pairs of image-text data selected from LAION-face. I performed facial cropping and alignment operations on these 750,000 facial images, resulting in 224x224-sized facial images. I used the cropped facial images as input for CLIP, with the corresponding text descriptions from the laion-face dataset as textual conditions. The training goal was to reconstruct the original images. I set the training steps to 1 million, saved results every 10,000 steps, the SD model is 1.5,and trained on a machine with 8 V100 32GB GPUs. However, when testing with scale=1.0, I found that the generated model cannot effectively maintain identity. Additionally, there is a noticeable change in facial structure with each 10,000 steps, and the generated faces differ significantly from the input.I wonder if I miss some training details. My training code is based on the tutorial_train_plus with only the following modifications: #ip-adapter-plus clip_image = self.clip_image_processor(images=raw_image, return_tensors="pt").pixel_values

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

Hello, I see ip-adapter-full-face_sd15.bin has been recently released. Could you explain what is the difference between this and previously released version of IP-Adapter-Face? Also, is this just for SD 1.5 or can work with SDXL too? Thanks

data: we remove some small faces and do some crop augmentions.

data preprocessing: we segment the face and remove background.

model: we use full tokes (256 patch tokens + 1 cls tokens) and use a simple MLP to get face features.

IP-Adapter should be universal, not limited to human faces, for example, it can be used for clothing #135 (comment)

I have a question that how you define the prompt when training the face model? Do you use a detailed prompt of the target image or just simple prompt like 'a person'? I found that the generation quality in inference degrades when we use some detailed prompts.

I use detailed prompt.

do you tested my face demo? if it performs better than my model?

I tested the results, and the generated face IDs by the model at every 10,000 steps are inconsistent. Therefore, I believe my training has failed. Additionally, when I set the text prompt to an empty string and input the cropped facial images, a portion of the model fails to generate images containing faces.

I think you can make a comparison with my model https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter-full-face_demo.ipynb

Sorry，I just discovered a bug in my code where I applied transform to the face images twice. By the way, I'd like to confirm if my data processing flow and training details are correct?

Initially, I only used cropped facial images with an empty text prompt, attempting to reconstruct the cropped faces solely based on the image features extracted by CLIP. However, I found that as the training progressed, the reconstruction performance improved initially but then deteriorated. It was consistently challenging to fully preserve identity features, with only intermediate results exhibiting relatively high similarity. Have you conducted similar experiments before?

I think that only reconstructing the cropped faces is meaningless. To improve face consistency, you can also use ID embedding from face model, I found it is very helpful.

tencent-ailab / IP-Adapter

ip-adapter-full-face_sd15.bin details #140

ip-adapter-plus

ip-adapter-full

ip-adapter-plus

ip-adapter-full

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

tencent-ailab / IP-Adapter

ip-adapter-full-face_sd15.bin details #140

ip-adapter-plus

ip-adapter-full

ip-adapter-plus

ip-adapter-full

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

ip-adapter-full face_image_file = item["face_image_file"] face_image = Image.open(os.path.join(self.image_root_path, face_image_file)) clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values

ip-adapter-plus image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )

ip-adapter-full image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`

ip-adapter-full `face_image_file = item["face_image_file"]` `face_image = Image.open(os.path.join(self.image_root_path, face_image_file))` `clip_image = self.clip_image_processor(images=face_image, return_tensors="pt").pixel_values`

ip-adapter-plus `image_proj_model = Resampler( dim=unet.config.cross_attention_dim, depth=4, dim_head=64, heads=12, num_queries=args.num_tokens, embedding_dim=image_encoder.config.hidden_size, output_dim=unet.config.cross_attention_dim, ff_mult=4 )`

ip-adapter-full `image_proj_model=MLPProjModel(cross_attention_dim=unet.config.cross_attention_dim, clip_embeddings_dim=image_encoder.config.hidden_size)`