Open 870572761 opened 2 months ago
Sorry for the late reply, for the results in the paper we only used gpt-4o
as the value function.
Maybe I think It would be better to average the model evaluations
Yes I think an ensemble would probably work better, but for simplicity we stuck to a single model :)
I found if I just run the scripts to test "VisualWebArena benchmark" experiment. The task finnally will fail in many times. Did you set just one model in models? Did you just make model evaluate once time?(Maybe I think It would be better to average the model evaluations)