kohjingyu / fromage

🧀 Code and models for the ICML 2023 paper "Grounding Language Models to Images for Multimodal Inputs and Outputs".
https://jykoh.com/fromage
Apache License 2.0
466 stars 34 forks source link

Failure in testing the demo #9

Closed Yingjia-Wan closed 1 year ago

Yingjia-Wan commented 1 year ago

Hi! Thank you for the amazing work and for putting up the demo link!

However, I have experienced some consistent problems when trying on the demo (please see the screenshots below). The general principle I found is that the FROMAGe chatbot can only effectively respond to my uploaded image prompt, and tend to automatically perform the image-captioning task regardless of my text input.

Two examples:

E.g., 1. No matter which text prompt I submitted in the chat, the FROMAGe chatbot is negatively indifferent to my questions. (a bit funny)

image

E.g., 2. When I tried to upload an image first, the chatbot successfully identified the object in the image with language output. However, in the subsequent conversations, it keeps repeating the same task and output, ignoring my following text input. FROMAGe-demo 2

E.g., 3. it is very easy to get no response/infinite running time when entering textual input.

image

=> I wonder what might be the cause of such problems from your perspective, and whether this is easy to fix. Thank you!

kohjingyu commented 1 year ago

Thanks for trying the demo. The model is a bit sensitive to prompts, so you might want to try the following things:

1) Increase the frequency multiplier on the right, to increase the frequency of outputting an image. Difference conversations might require different multipliers. 2) To get around repetition of text, you can try increase the sampling temperature (and top-p/top-k in the code). The underlying LLM is the OPT-6.7B model from Meta, and like many other LLMs, it has this failure mode.

Usually I find these are enough to get some decent examples. For some other cases, like with the sofa/couch example, it doesn't require as much tuning. I haven't tried with the kind of scenic pictures you are using, so perhaps it is not as sensitive.

Hope that helps!