Open s7ev3n opened 1 year ago
+1 to this, since when I try to use the blip model's get_features functions, I get differently sized sequence dimensions for the returned text embeddings across batches (sometimes B, 19, 768, sometimes B, 21, 768).
I call it as:
features_multimodal_txt = self.model.extract_features(sample_copy, mode="text").text_embeds
Shouldn't it all be padded to the same max length?
Hello all,
I am facing the same problem. Did you manage to find any workaround?
Thanks a lot ;)
grab the first token returned. It corresponds to [CLS] token. This is a standard practice in LLM transformers. See this notebook - they grab the first token https://github.com/salesforce/LAVIS/blob/main/examples/blip2_feature_extraction.ipynb
Hi,
I notice that in
blip2_qformer.py
, in theforward
function, the text_tokens are truncated to max_length which is 32, while inextract_feature
function which to my understanding is an inference function , the text_tokens are not truncated, which could be much larger than in the training which is theforward
function.May I ask why is the difference ? I especially do not understand why text token is restricted to 32 in training.
Looking forward to the answer :) Thanks