Open ai-nikolai opened 10 months ago
Hi @ai-nikolai , what model are you using?
Thanks. The model used: gpt-3.5-turbo
@noahshinn
@noahshinn would it also be possible to upload the actual game logs for alfworld as well?
The model gpt-3.5-turbo
is not the same model used during the paper's time (Feb 2023). We used text-davinci-002
. I'd expect that the mistakes you see result from the inferred action not matching any of the actions in the action space. We followed ReAct's implementation for AlfWorld results to stay consistent with their work.
To aid this, I would advise you to display the action space to the model to eliminate parsing errors. I can add a side implementation for this if it would be helpful for you. Also, I will dig to see if I can find the original log files from the text-davinci-002
runs.
Thank you @noahshinn.
Please let us know, if there was any luck finding the original logs using text-davinci-002
. This would be a really big help. Thank you.
I had the same issue with got-3.5-turbo. The success rate seems much much lower. The first trial success rate for me on a subset of tasks is only around 17% which is consistent with the report from Agentbench paper. So if you could provide the original log would be really helpful
Hi all,
A couple of comments to follow-up on this:
text-davinci-002
is deprecated, the two alternatives davinci-002
and gpt-3.5-turbo
both have an accuracy of 0.3 on a subset, while your reported results have 0.7). Could you provide the traces, or tell us how we could produce your results. text-davinci-002
(which is the model your code shows, only achieves 16%, which is in-line with our reproducibility experiments).info["won"]==True
, while you use done==True
. This is referenced in the original alfworld repository as an issue https://github.com/alfworld/alfworld/issues/51 Concrete Actions / Questions:
@noahshinn @ysymyth @becklabs
@noahshinn - any updates on the above?
Hi @ai-nikolai, I am also trying to reproduce the results. The performance was bad in the beginning. After adding these lines to parse the action, the performance went back to normal:
@CSUN1997 @noahshinn @dong-river @ysymyth - It seems there are a couple of issues, which are summarised in this paper StateAct (https://arxiv.org/abs/2410.02810).
Specifically the issues are:
put <object> in <place>
, however, the correct alfworld syntax is put <object> in/on <place>
.
Hi,
Thanks for the great work. Unfortunately, we are unable to reproduce your results for ReAct / Reflexion on Alfworld.
E.g. Env0 & Env1 are successful for you, however, we always get failures on our end. (Other Envs are successful though, so it does work sometimes).
@noahshinn