What should key_position be on mPLUG-Owl2?

shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

MIT License

244 stars 22 forks source link

Hi,

Thanks for your appreciation! You may kindly check out this issue https://github.com/shikiw/OPERA/issues/2#issuecomment-1851351679 for the example of how to set up the "image_start", "image_end" and "response_start" keys. Similarly, mPLUG-Qwl2 also has the image tokens obtained from Visual Abstractor and the text tokens obtained from Text Embedding Layer, which make up all of input tokens for Language Decoder. You may specify the token indexes before the generation like https://github.com/shikiw/OPERA/blob/dba0dda9457a3234d22ef4b60ea38b74a02d3905/minigpt4/models/mini_gpt4.py#L375-L380

For LLaVA-1.5, "NUM_IMAGE_TOKENS = 576" is because it adopts CLIP ViT-L/14 as the vision encoder, thus it outputs 576 image tokens as the visual representation.

shikiw / OPERA

What should key_position be on mPLUG-Owl2? #7