open-compass / T-Eval

[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
https://open-compass.github.io/T-Eval/
Apache License 2.0
235 stars 15 forks source link

qwen-14b评测结果疑问 #38

Open Fenglly opened 8 months ago

Fenglly commented 8 months ago

用作者提供的模板自行实现的CustomAPI类评测qwen-14b-chat模型得到如下结果: Instruct Plan Review Reason Retrieve Understand overall 97.0 78.0 41.9 60.0 86.6 61.8 70.9 Retrieve和Understand阶段的指标与ZH Leaderboard上相差较大