Open TimeLessLing opened 1 year ago
hi, the mixed hidden states are text features + image features, the text features has 77 tokens, while image features has 4 tokens (IP-Adapter-plus has 16 tokens). hence we can split text features and image features from num_tokens:
end_pos = encoder_hidden_states.shape[1] - self.num_tokens
encoder_hidden_states, ip_hidden_states = (
encoder_hidden_states[:, :end_pos, :],
encoder_hidden_states[:, end_pos:, :],
)
Thank you very much for your great work ! I encountered a problem while reading the source code: what is the role of num_tokens?
I found the
num_tokens
parameter in the source code ofIPAttnProcessor
in the attention_processor.py.The only scenario where
num_tokens
is used is in forward to split an ip_hidden_states from the original hidden states for the calculation of the new attention mechanism, which corresponds to formulas (4) and (5) in the paper.But if I understand it correctly, in formula (5) of the paper, the new IP attention mechanism should use image features for calculation, but in the code, it seems that num_tokens is cut out from the mixed hidden states part to implement a new attention mechanism. How to ensure that the hidden states in the last num_tokens part only correspond to image features?
Thank you very much.