njucckevin / SeeClick

The model, data and code for the visual GUI Agent SeeClick
Apache License 2.0
226 stars 12 forks source link

Prompts for other models #45

Open likaixin2000 opened 2 months ago

likaixin2000 commented 2 months ago

Hi, I am trying to compare models using ScreenSpot. What were the prompts you used for QwenVL, Fuyu, and CogAgent?

njucckevin commented 1 month ago

Hi, For CogAgent, we randomly chose three from their official prompts, as prompts = ["What steps do I need to take to \"{}\"?(with grounding)", "Can you advise me on how to \"{}\"?(with grounding)", "I'm looking for guidance on how to \"{}\".(with grounding)"] For fuyu, we determined the prompt based on discussions with the authors, as in https://huggingface.co/adept/fuyu-8b/discussions/42. Probably "When presented with a box, perform OCR to extract text contained within it. If provided with text, generate the corresponding bounding box.\n{}" For Qwen-VL, we follow their official example in GitHub, probably "Generate the bounding box of {}".