irthomasthomas / undecidability

13 stars 2 forks source link

Berkeley Function-Calling Leaderboard #772

Open irthomasthomas opened 8 months ago

irthomasthomas commented 8 months ago

Berkeley Function-Calling Leaderboard

Description

This live leaderboard evaluates the LLM's ability to call functions (aka tools) accurately. This leaderboard consists of real-world data and will be updated periodically. For more information on the evaluation dataset and methodology, please refer to our blog post and code release.

Leaderboard

Rank 🔼 Overall Acc Model Organization License AST Summary Exec Summary Relevance
1 84.28 GPT-4-1106-Preview OpenAI Proprietary 86.06 65.53 88.75
2 84.16 GPT-4-0125-Preview OpenAI Proprietary 85.61 67.24 87.50
3 84.16 Gorilla-OpenFunctions-v2 Gorilla LLM Apache 2.0 84.33 72.72 71.67
4 83.67 Claude-3-Opus-20240229 Anthropic Proprietary 79.82 73.73 84.58
5 81.75 Mistral-medium-2312 Mistral AI Proprietary 78.67 66.93 90.00
6 80.30 Claude-3-Sonnet-20240229 Anthropic Proprietary 84.91 76.15 41.25
7 80.30 GPT-3.5-Turbo-0125 OpenAI Proprietary 81.55 69.43 68.33
8 79.07 Functionary-Medium-v2.2 MeetKai N/A 82.25 61.97 61.97
9 77.41 Claude-2.1 Anthropic Proprietary 76.53 53.93 78.33
10 61.75 Mistral-tiny-2312 Mistral AI Proprietary 55.28 53.42 77.08
11 61.02 Claude-instant-1.2 Anthropic Proprietary 57.06 49.88 61.67
12 56.87 Mistral-small-2312 Mistral AI Proprietary 57.01 36.18 89.58
13 56.81 Mistral-large-2402 Mistral AI Proprietary 40.58 38.49 84.58
14 55.90 Nexusflow-Raven-v2 Nexusflow Apache 2.0 58.01 63.67 0.00
15 55.87 Firefunction-v1 Fireworks-ai Apache 2.0 40.05 37.31 81.25
16 55.68 Gemini-1.0-Pro Google Proprietary 42.18 29.30 78.30
17 54.52 GPT-4-0613 OpenAI Proprietary 40.14 27.12 87.08
18 45.96 Deepseek-v1.5 Deepseek Deepseek License 48.59 8.55 66.25
19 44.40 Gemma-7B-IT Google gemma-term-of-use 48.61 40.43 0.42
20 33.37 Gorilla-OpenFunctions-v0 Gorilla LLM Apache 2.0 29.88 24.06 4.58
21 24.58 Glaive-v1 Glaive cc-by-sa-4.0 15.14 14.92 46.25

Source

Berkeley Function-Calling Leaderboard

Suggested labels

irthomasthomas commented 8 months ago

Related content

331 similarity score: 0.89

625 similarity score: 0.89

456 similarity score: 0.89

358 similarity score: 0.88

366 similarity score: 0.88

725 similarity score: 0.88