Closed jinsikbang closed 4 months ago
Hello, Thank you for reporting the issue and providing detailed information. I have updated the src/wah/wah_evaluator.py file to ensure that the code works correctly with the WAH-NL dataset using GPT-4. Please pull the latest changes from the repository and try running the evaluation again.
Let us know if you encounter any further issues or have any other questions.
We encountered the same problem when reproducing the Llama2 series of models.
Hello,
I tried the benchmark on WAH-NL environment using GPT-4, but some error occurred.
[Model] model_name: "OpenAI/gpt-4"
[Code] python src/evaluate.py --config-name=config_wah, python src/evaluate.py --config-name=config_wah_headless
[Error] Traceback (most recent call last): File "/home/----/project/LLMTaskPlanning/src/wah/wah_evaluator.py", line 79, in evaluate_main result = self.evaluate_task(task_planner, env, nl_instruction, task_d, log_prompt=False) File "/home/----/project/LLMTaskPlanning/src/wah/wah_evaluator.py", line 99, in evaluate_task step, prompt = task_planner.plan_step_by_step(nl_instruction, prev_steps, prev_action_msg) File "/home/----/project/LLMTaskPlanning/src/task_planner.py", line 213, in plan_step_by_step scores = self.score(prompt, self.skill_set) File "/home/----/project/LLMTaskPlanning/src/task_planner.py", line 89, in score scores = out['score'] File "/home/----/anaconda3/envs/llm-embodied/lib/python3.8/site-packages/guidance/_program.py", line 470, in getitem return self._variables[key] KeyError: 'score' [2024-05-05 01:28:55,147][wah.wah_evaluator][INFO] - Error: KeyError('score') assert prompt.endswith("<|im_start|>assistant\n"), "When calling OpenAI chat models you must generate only directly inside the assistant role! The OpenAI API does not currently support partial assistant prompting." AssertionError: When calling OpenAI chat models you must generate only directly inside the assistant role! The OpenAI API does not currently support partial assistant prompting
I ran the code with the following settings, Could you check the wah evaluate code?