tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
4.52k stars 298 forks source link

How to change the expression when using the faceid model #214

Open guijuzhejiang opened 6 months ago

guijuzhejiang commented 6 months ago

Great job, I used the faceid model and the characters are very well reproduced. But I also found a problem, probably because of face embedding, the expression of the reference person is highly restored, even if I write different expression ptompts in prompts, I can't change the expression as well, is there any way to change the expression of the reference person?

xiaohu2015 commented 6 months ago

faceid model should be able to change expressions, faceid plus model maybe not, but you can try faceid plus v2 to achieve that (use lower weight)

YZBPXX commented 6 months ago

faceid model should be able to change expressions, faceid plus model maybe not, but you can try faceid plus v2 to achieve that (use lower weight)

I visualized the face token maps for the faceid and faceid_plus model. From the attention maps, it appears that faceid_plus controls more accurately. Can I merge the visualization code into the main branch? It would simplify visualizing face token maps for all models you release.

截屏2024-01-03 20 41 27 截屏2024-01-03 20 41 52
xiaohu2015 commented 6 months ago

thanks a lot

rafstahelin commented 6 months ago

How does one visualise the face embeddings?Best, Raf On 3 Jan 2024, at 13:42, YZBPXX @.***> wrote:

faceid model should be able to change expressions, faceid plus model maybe not, but you can try faceid plus v2 to achieve that (use lower weight)

I visualized the face token maps for the faceid and faceid_plus model. From the attention maps, it appears that faceid_plus controls more accurately. Can I merge the visualization code into the main branch? It would simplify visualizing face token maps for all models you release. 2024-01-03.20.41.27.png (view on web) 2024-01-03.20.41.52.png (view on web)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

YZBPXX commented 6 months ago

How does one visualise the face embeddings?Best, Raf On 3 Jan 2024, at 13:42, YZBPXX @.> wrote: faceid model should be able to change expressions, faceid plus model maybe not, but you can try faceid plus v2 to achieve that (use lower weight) I visualized the face token maps for the faceid and faceid_plus model. From the attention maps, it appears that faceid_plus controls more accurately. Can I merge the visualization code into the main branch? It would simplify visualizing face token maps for all models you release. 2024-01-03.20.41.27.png (view on web) 2024-01-03.20.41.52.png (view on web) —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.>

You can browse the recently updated code in the 'visual_attnmap.ipynb' notebook.

juancopi81 commented 6 months ago

This is great! Would the get_net_attn_map function work for the ip_adapter_plus? (no faceid)

YZBPXX commented 6 months ago

This is great! Would the get_net_attn_map function work for the ip_adapter_plus? (no faceid)

I just submitted a pull request. You just need to use the following three lines of code as you would use faceid

pipe.unet = register_cross_attention_hook(pipe.unet)
attn_maps = get_net_attn_map((768, 512))
attn_hot = attnmaps2images(attn_maps)
juancopi81 commented 6 months ago

That's great @YZBPXX Thanks. A final question, in get_nett_attn_map what batch_size means here? Is it related with the pos. neg. conditions or with the number of images?

YZBPXX commented 6 months ago

That's great @YZBPXX Thanks. A final question, in get_nett_attn_map what batch_size means here? Is it related with the pos. neg. conditions or with the number of images?

Yes, currently, I have only considered the case of num_samples=1, batch_size=2 (generate one image requires sampling twice). If generating multiple images at once, it may throw an error, and some modifications need to be made. I will make modifications later to support num_samples > 1. If you have any more questions, I'd be happy to answer them for you.

juancopi81 commented 6 months ago

Thanks a lot @YZBPXX! I was looking at my shapes of the attn_maps and found:

self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 64, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16])

Just to be sure, the last dimension (16) is the features were the image is projected, so for the attention maps, I could average them?

YZBPXX commented 6 months ago

Thanks a lot @YZBPXX! I was looking at my shapes of the attn_maps and found:

self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 64, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 256, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 1024, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16]) self.attn_map.shape torch.Size([2, 8, 4096, 16])

Just to be sure, the last dimension (16) is the features were the image is projected, so for the attention maps, I could average them?

the last dimension(16) is the number of 16 face tokens. If you average them, it should yield the combined result of the 16 tokens. I have attempted to do this, but the displayed image is just a noise image without any meaningful information.