Closed AlanZhang1995 closed 4 months ago
the keys and values of attention include self feature https://github.com/tencent-ailab/IP-Adapter/blob/main/ip_adapter/resampler.py#L63, hence, there is interaction among queries. You can think that we merge a self-attention and a cross-attention into a attention.
we follow open_flamingo code, but in fact, some models e.g. https://github.com/openai/glide-text2im also use this design
Excellent explanation! Thank you for your response!
Hello, thank you for sharing this impressive work as open source!
In your wiki section, I came across the statement "we use a Q-Former (16 tokens) to extract face features from CLIP image embeddings." Does the Q-former correspond to the resampler structure here? If so, it appears to be a modified Q-former without self-attention. Consequently, there may be no interaction among queries, correct?
I'm also curious about the PerceiverAttention line 63, where it seems that the key and value take both x and latent as input. I'm interested to know the reasoning behind designing a cross-attention like this.
Thank you in advance for your insights!