Text tokenizer difference between foward and extract_feature

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

BSD 3-Clause "New" or "Revised" License

9.9k stars 971 forks source link

Text tokenizer difference between foward and extract_feature #492

Open s7ev3n opened 1 year ago

s7ev3n commented 1 year ago

Hi,

I notice that in blip2_qformer.py, in the forward function, the text_tokens are truncated to max_length which is 32, while in extract_feature function which to my understanding is an inference function , the text_tokens are not truncated, which could be much larger than in the training which is the forward function.

May I ask why is the difference ? I especially do not understand why text token is restricted to 32 in training.

Looking forward to the answer :) Thanks

gunshi commented 1 year ago

+1 to this, since when I try to use the blip model's get_features functions, I get differently sized sequence dimensions for the returned text embeddings across batches (sometimes B, 19, 768, sometimes B, 21, 768). I call it as: features_multimodal_txt = self.model.extract_features(sample_copy, mode="text").text_embeds

Shouldn't it all be padded to the same max length?

billpsomas commented 9 months ago

Hello all,

I am facing the same problem. Did you manage to find any workaround?

Thanks a lot ;)

philkuz commented 7 months ago

grab the first token returned. It corresponds to [CLS] token. This is a standard practice in LLM transformers. See this notebook - they grab the first token https://github.com/salesforce/LAVIS/blob/main/examples/blip2_feature_extraction.ipynb