Details about resampler/Q-former

AlanZhang1995 commented 4 months ago

Hello, thank you for sharing this impressive work as open source!

In your wiki section, I came across the statement "we use a Q-Former (16 tokens) to extract face features from CLIP image embeddings." Does the Q-former correspond to the resampler structure here? If so, it appears to be a modified Q-former without self-attention. Consequently, there may be no interaction among queries, correct?

I'm also curious about the PerceiverAttention line 63, where it seems that the key and value take both x and latent as input. I'm interested to know the reasoning behind designing a cross-attention like this.

Thank you in advance for your insights!

xiaohu2015 commented 4 months ago

the keys and values of attention include self feature https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/resampler.py#L63, hence, there is interaction among queries. You can think that we merge a self-attention and a cross-attention into a attention.

we follow open_flamingo code, but in fact, some models e.g. https://github.com/openai/glide-text2im also use this design

AlanZhang1995 commented 4 months ago

Excellent explanation! Thank you for your response!

tencent-ailab / IP-Adapter

Details about resampler/Q-former #305