Support for interleaved image-text comprehension(multi-image)

shikiw / OPERA

[CVPR 2024 Highlight] OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation

MIT License

244 stars 22 forks source link

Support for interleaved image-text comprehension(multi-image) #13

Closed laserwave closed 4 months ago

laserwave commented 5 months ago

Hi, Congratulations on your great work. Does OPERA decoding support multi-image input？

For example:

Image1: <image>\nImage2: <image>\nWhat is the difference between image1 and image2?

If not, do you have any plan for this?

shikiw commented 4 months ago

Hi, thanks for your appreciation!

The current implementation of OPERA supports multi-image input, but we haven't test its performance yet. You can set <image_start> as the first token index of Image 1, and set <image_end> as the last token index of Image 2. Please refer to https://github.com/shikiw/OPERA/issues/2 and https://github.com/shikiw/OPERA/issues/7 for how to set key_position.