Value function for Text WebArena is multi-modal

Olafyii commented 2 months ago

Thanks for sharing the interesting work!

I notice that the script to run the text WebArena evaluation utilizes the value function here https://github.com/kohjingyu/search-agents/blob/c14a5d1bf2bea07f7aaa761d460c9d44e953d95f/agent/value_function.py#L16 which is by default multi-modal.

I think this would provide an unintended edge compared to baseline models which are only prompt-based?

kohjingyu commented 2 months ago

I'm not sure that the multimodal value function helps much on WebArena (which is primarily a text based benchmark). For example, this paper report that using a multimodal GPT-4o gets 24% on WebArena, while the equivalent text only GPT-4o gets 23.5%.

870572761 commented 1 month ago

Thanks for sharing the interesting work! Yes, I also find it too, and I try to change it to llama3 model. But finally fail, and I check it try to find out how to solve it.(Hope author could give me some tips Orz. It will be helpful.) So I back to check read paper. In paper, it has mentioned that "We generate captions for each image on the webpage using an off-the-shelf captioning model (in our case, BLIP-2; Li et al. 2023). " Therefore, I think if want to solve it, we need to figure out how to add the caption of picture to promot. If you successfully find the solution, hope you could give me the code.

`if args.value_function in ["gpt4o"]:
                                # score = value_function.evaluate_success(
                                #     screenshots=last_screenshots[-(args.max_depth+1):] + [obs_img], actions=temp_action_history,
                                #     current_url=env.page.url, last_reasoning=a["raw_prediction"],
                                #     intent=intent, models=["gpt-4o-2024-05-13"],
                                #     intent_images=images if len(images) > 0 else None)
                                score = value_function.evaluate_success(
                                    screenshots=last_screenshots[-(args.max_depth+1):] + [obs_img], actions=temp_action_history,
                                    current_url=env.page.url, last_reasoning=a["raw_prediction"],
                                    intent=intent, models=[args.model],
                                    intent_images=images if len(images) > 0 else None)`

870572761 commented 1 month ago

I'm not sure that the multimodal value function helps much on WebArena (which is primarily a text based benchmark). For example, this paper report that using a multimodal GPT-4o gets 24% on WebArena, while the equivalent text only GPT-4o gets 23.5%.

So just Does deleting the code of appending the image which the agent was captured in the code replicates experiment? At least make it work in llama3. Like that code in the _search-agents/agent/valuefunction.py

content.extend([
                {
                    "type": "image_url",
                    "image_url": {
                        "url": pil_to_b64(img)
                    },
                }
            ])

870572761 commented 1 month ago

Thanks for sharing the interesting work!

I notice that the script to run the text WebArena evaluation utilizes the value function here

https://github.com/kohjingyu/search-agents/blob/c14a5d1bf2bea07f7aaa761d460c9d44e953d95f/agent/value_function.py#L16

which is by default multi-modal. I think this would provide an unintended edge compared to baseline models which are only prompt-based?

My first try is using this code


        # Caption the input image, if provided. 好像没有捕捉
        if images is not None and len(images) > 0:
            if self.captioning_fn is not None:
                image_input_caption = ""
                for image_i, image in enumerate(images):
                    if image_i == 0:
                        image_input_caption += f'Input image {image_i+1}: "{self.captioning_fn([image])[0]}"'
                    else:
                        image_input_caption += f'input image {image_i+1}: "{self.captioning_fn([image])[0]}"'
                    if len(images) > 1:
                        image_input_caption += ", "
                # Update intent to include captions of input images.
                intent = f"{image_input_caption}\nIntent: {intent}"
            elif not self.multimodal_inputs:
                print(
                    "WARNING: Input image provided but no image captioner available."
                )

kohjingyu commented 1 month ago

Sorry, just to make sure I understand: are you trying to create a text-only value function (with captions)?

870572761 commented 1 month ago

Yes，but finnally I think it will not work, so I decide to use the another Open-Source Large Models like internvl 8b rather than gpt4.

kohjingyu / search-agents

Value function for Text WebArena is multi-modal #3