kohjingyu / search-agents

Code for the paper 🌳 Tree Search for Language Model Agents
https://jykoh.com/search-agents
MIT License
123 stars 13 forks source link

How many times is the value function evaluated in the your " VisualWebArena benchmark" experiment? #5

Open 870572761 opened 1 month ago

870572761 commented 1 month ago

image I found if I just run the scripts to test "VisualWebArena benchmark" experiment. The task finnally will fail in many times. Did you set just one model in models? Did you just make model evaluate once time?(Maybe I think It would be better to average the model evaluations)

kohjingyu commented 3 weeks ago

Sorry for the late reply, for the results in the paper we only used gpt-4o as the value function.

Maybe I think It would be better to average the model evaluations

Yes I think an ensemble would probably work better, but for simplicity we stuck to a single model :)