How many times is the value function evaluated in the your " VisualWebArena benchmark" experiment?

kohjingyu / search-agents

Code for the paper 🌳 Tree Search for Language Model Agents

https://jykoh.com/search-agents

MIT License

138 stars 16 forks source link

How many times is the value function evaluated in the your " VisualWebArena benchmark" experiment? #5

Open 870572761 opened 2 months ago

870572761 commented 2 months ago

I found if I just run the scripts to test "VisualWebArena benchmark" experiment. The task finnally will fail in many times. Did you set just one model in models? Did you just make model evaluate once time?（Maybe I think It would be better to average the model evaluations）

kohjingyu commented 2 months ago

Sorry for the late reply, for the results in the paper we only used gpt-4o as the value function.

Maybe I think It would be better to average the model evaluations

Yes I think an ensemble would probably work better, but for simplicity we stuck to a single model :)