tangqiaoyu / ToolAlpaca

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Apache License 2.0
281 stars 34 forks source link

Errors in Evaluation #14

Open jianguoz opened 9 months ago

jianguoz commented 9 months ago

Hi @tangqiaoyu, thanks for your great released work! I am trying to evaluate your checkpoint through:

# get LLM outputs
python instance_generation/generation.py \
  -api ./data/eval_simulated.json \
  -out ./eval \
  -llm TangQiaoYu/ToolAlpaca-13B \
  --agent_prompt test_v2 \
  --use_cache

However, it raises below errors in output (see full error in the attached file):

instance_generation/generation.py", line 86, in <module>
    output = agent(inst)
             ^^^^^^^^^^^
  File "/export/home/miniconda3/envs/toolalpaca/lib/python3.11/site-packages/langchain/chains/base.py", line 116, in __call__
    raise e
  File "/export/home/miniconda3/envs/toolalpaca/lib/python3.11/site-packages/langchain/chains/base.py", line 113, in __call__
    outputs = self._call(inputs)
              ^^^^^^^^^^^^^^^^^^
  File "/export/home/miniconda3/envs/toolalpaca/lib/python3.11/site-packages/langchain/agents/agent.py", line 792, in _call
    next_step_output = self._take_next_step(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "toolalpaca/agent/custom_agent_executor.py", line 22, in _take_next_step
    output = self.agent.plan(intermediate_steps, **inputs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/export/home/miniconda3/envs/toolalpaca/lib/python3.11/site-packages/langchain/agents/agent.py", line 385, in plan
    return self.output_parser.parse(full_output)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "toolalpaca/agent/custom_parser.py", line 23, in parse
    raise ValueError(f"Could not parse LLM output: `{text}`")
ValueError: Could not parse LLM output: `

error.txt

I am following your langchain/transformer version, and sentencepiece==0.1.99, and torch==2.0.0+cu118

Could you please provide some insights here? Appreciate your help

tangqiaoyu commented 9 months ago

Sorry for the inconvenience, I suspect this might stem from a minor error in my README. Could you please try changing --agent_prompt test_v2 to --agent_prompt test_v1? This is because test_v1 was the prompt used during the training of this model.

jianguoz commented 8 months ago

Hi @tangqiaoyu , thanks for your suggestions! I run the code following test_v1. However, I still have below errors:

  1. It still encounters same ValueError: Could not parse LLM output: errors. See details in error_out.txt and error_out_2.txt.
  2. Sometimes it could successfully run and save the generation files. However, several cases in the file do not have Instances and raise errors when we run evaluation.py to eval the generations. Could you check it? Appreciate your help!
jianguoz commented 8 months ago

Hi @tangqiaoyu, just a kind follow up in case you missed the new comment:). Thanks for your time

tangqiaoyu commented 8 months ago

I apologize for the delayed response. ValueError: Could not parse LLM output: is a normal phenomenon, especially with models that haven't been trained on the dataset. As seen in your error_out_2.txt file, the sequence length is logger than the max length of the model and the model generate a ASSISTANT thought without a ASSISTANT action as the end. So the code cannot parse the output. To address this, I suggest uncommenting the code in generation.py at lines 90-92.

Thank you for bringing this to our attention, and please let me know if there are any further issues or questions.

jianguoz commented 8 months ago

Hi @tangqiaoyu, thanks very much for your detailed reply. I can now run the evaluation successfully after uncommenting lines 90-92. When I evaluate TangQiaoYu/ToolAlpaca-13B, the most common issue is json formats issue like log_simulation.txt, i.e., I guess a fine-tuned model or even gpt-3.5-turbo may have difficulties to generate json format. The final output statistics is below

    "statistics": {
        "num": 100,
        "error_num": 10,
        "process": {
            "Yes": 55,
            "No": 35,
            "Uncertain": 0
        },
        "response": {
            "Yes": 60,
            "No": 30,
            "Uncertain": 0
        },
        "both": 53
    },

See the full model generation in eval_ToolAlpaca-13B.json.

I am little bit confused about the definitions of each key. It seems that the scores of "process", "response" and "both" are much lower than the numbers reported in the paper. Could you provide some insights here?

Another question is that I use the suggested test_v1 to evaluate your model. To make the model more robust or improve performance. Should I use both test_v1 and test_v2 if i want to fine-tune the llama2 model in the training data?

Appreciate your time and help!

tangqiaoyu commented 8 months ago

Hi, @jianguoz, thanks for the further details. Here's my response to each of your questions:

  1. It's odd that you're encountering errors here. The JSON string handling is already taken care of in our code at this link, ensuring all simulator inputs are valid JSON. But, it looks like the error you're facing is coming from the simulator.
  2. Generating JSON strings, even complex ones, is typically not a big problem for GPT-3.5-turbo. For instance, in the tool_maker section, we successfully used it to create detailed OpenAPI Specifications.
  3. There might be some variations in the evaluation outcomes. I'll update the model output details to provide more clarity and help you understand the specifics.
  4. I recommended test_v1 because our model was trained with those prompts. It's mainly a format issue, but training with both test_v1 and test_v2 should be fine if you're looking to fine-tune the llama2 model in your training data.
jianguoz commented 7 months ago

@tangqiaoyu Thanks for the insightful comments! It seems that the reproduction produces low results and not sure the potential issues come from the simulator. Do you have updates regarding to the model output details? Thanks:)

tangqiaoyu commented 7 months ago

@jianguoz Apologies for the delayed response. I have uploaded the GPT-4 evaluation results. Could you please check the model outputs and evaluation results to find the potential issues?