WAH-NL GPT-4 benchmark - Githubissues

jinsikbang commented 6 months ago

Hello,

I tried the benchmark on WAH-NL environment using GPT-4, but some error occurred.

[Model] model_name: "OpenAI/gpt-4"

[Code] python src/evaluate.py --config-name=config_wah, python src/evaluate.py --config-name=config_wah_headless

[Error] Traceback (most recent call last): File "/home/----/project/LLMTaskPlanning/src/wah/wah_evaluator.py", line 79, in evaluate_main result = self.evaluate_task(task_planner, env, nl_instruction, task_d, log_prompt=False) File "/home/----/project/LLMTaskPlanning/src/wah/wah_evaluator.py", line 99, in evaluate_task step, prompt = task_planner.plan_step_by_step(nl_instruction, prev_steps, prev_action_msg) File "/home/----/project/LLMTaskPlanning/src/task_planner.py", line 213, in plan_step_by_step scores = self.score(prompt, self.skill_set) File "/home/----/project/LLMTaskPlanning/src/task_planner.py", line 89, in score scores = out['score'] File "/home/----/anaconda3/envs/llm-embodied/lib/python3.8/site-packages/guidance/_program.py", line 470, in getitem return self._variables[key] KeyError: 'score' [2024-05-05 01:28:55,147][wah.wah_evaluator][INFO] - Error: KeyError('score') assert prompt.endswith("<|im_start|>assistant\n"), "When calling OpenAI chat models you must generate only directly inside the assistant role! The OpenAI API does not currently support partial assistant prompting." AssertionError: When calling OpenAI chat models you must generate only directly inside the assistant role! The OpenAI API does not currently support partial assistant prompting

I ran the code with the following settings, Could you check the wah evaluate code?

Choi-JaeWoo commented 5 months ago

Hello, Thank you for reporting the issue and providing detailed information. I have updated the src/wah/wah_evaluator.py file to ensure that the code works correctly with the WAH-NL dataset using GPT-4. Please pull the latest changes from the repository and try running the evaluation again.

Let us know if you encounter any further issues or have any other questions.

mcpaulgeorge commented 1 month ago

We encountered the same problem when reproducing the Llama2 series of models.

lbaa2022 / LLMTaskPlanning

WAH-NL GPT-4 benchmark #2