microsoft / SmartPlay

SmartPlay is a benchmark for Large Language Models (LLMs). Uses a variety of games to test various important LLM capabilities as agents. SmartPlay is designed to be easy to use, and to support future development of LLMs.
Creative Commons Attribution 4.0 International
112 stars 13 forks source link

Replicate the scores for Table 2 #29

Open finalily opened 1 month ago

finalily commented 1 month ago

Hello Authors of SmartPlay,

Thank you for providing this nice testbed. I am trying to replicate the scores for Table 2, follow your env setting on git repo. eg. For RockPaperScissorBasic (RPS) game challenges: all: Error/Mistake Handling: 1 Generalization: 2 Instruction Following: 3 Learning from Interactions: 3 Long Text Understanding: 2 Planning: 1 Understanding the Odds: 3 Reasoning: 1 Spatial Reasoning: 1 recorded settings: RockPaperScissorBasic: iter: 20 steps: 50 human score: 43 min score: 0

I run with GPT-4, but the score i get for
RPS and Hanoi is 0.70 and 0.30 which is different from Table 2

GPT-4-0613 0.91 0.83 GPT-4-0314 0.98 0.90

Could you please share more details regarding LLM inference parameters. The temperature, top_p, frequency_penalty.

Hopefully I could use them to replicate your score on the paper Table2.

Thank you.

Holmeswww commented 1 month ago

Hi, we used temperature = 0, and default setting otherwise. What GPT version are you using?

finalily commented 1 month ago

Is it greedy decoding. May I verify

iter: 20 steps: 50

That iter 20 means run the same game 20 times. What are steps here for?

I think we are running on GPT-4 0315 preview.

Thank you!

Holmeswww commented 1 month ago

Hi. Please note that with the most recent turbo updates, the models may behave different.

this should be a matter of prompt engineering to get them on par with previous results. For example, chain-of-thought+telling the models to focus reasoning based on instruction manuals.

20 is indeed running the same game 20 times. And steps are basically number of steps we run per game until termination.