tencent-ailab / IP-Adapter

The image prompt adapter is designed to enable a pretrained text-to-image diffusion model to generate images with image prompt.
Apache License 2.0
4.68k stars 307 forks source link

What is the role of num_tokens? #147

Open TimeLessLing opened 8 months ago

TimeLessLing commented 8 months ago

Thank you very much for your great work ! I encountered a problem while reading the source code: what is the role of num_tokens?

I found the num_tokens parameter in the source code of IPAttnProcessor in the attention_processor.py.

The only scenario where num_tokens is used is in forward to split an ip_hidden_states from the original hidden states for the calculation of the new attention mechanism, which corresponds to formulas (4) and (5) in the paper.

But if I understand it correctly, in formula (5) of the paper, the new IP attention mechanism should use image features for calculation, but in the code, it seems that num_tokens is cut out from the mixed hidden states part to implement a new attention mechanism. How to ensure that the hidden states in the last num_tokens part only correspond to image features?

Thank you very much.

xiaohu2015 commented 8 months ago

hi, the mixed hidden states are text features + image features, the text features has 77 tokens, while image features has 4 tokens (IP-Adapter-plus has 16 tokens). hence we can split text features and image features from num_tokens:

            end_pos = encoder_hidden_states.shape[1] - self.num_tokens
            encoder_hidden_states, ip_hidden_states = (
                encoder_hidden_states[:, :end_pos, :],
                encoder_hidden_states[:, end_pos:, :],
            )