shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
MIT License
244 stars 22 forks source link

Questions about the IM_START and IM_END tokens #2

Open Haotian-Zhang opened 8 months ago

Haotian-Zhang commented 8 months ago

Thanks authors for the great work! I have a question regarding the

"image_start": START_INDEX_of_IMAGE_TOKENS, 
"image_end": END_INDEX_of_IMAGE_TOKENS, 

As these two tokens have been deprecated since the 1.3 release for LLaVA, could you please provide some instructions on how to specify these in the settings? Looking forward to the reply.

shikiw commented 8 months ago

Thanks for your appreciation and sorry for the misunderstanding!

These two tokens do not refer to the special token <image_start> and ```

```. Actually, they denote the location indexes of the first image token and the last image token. Here "image tokens" mean all of tokens that are relevant with the input image (if the model has `````` and ``` ```, then include them). ```START_INDEX_of_IMAGE_TOKENS``` = `````` ```END_INDEX_of_IMAGE_TOKENS``` = `````` For example, suppose we have the prompt like "`````` `````` `````` `````` ```Is``` ```_there``` ```_a``` ```_cat``` ```?```", the location indexes can be specified as: key_position = { "image_start": 1, "image_end": 3, "response_start": 9, } I hope this could be helpful for you :)