Feature Tracker: Evaluation Benchmarks

rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'

Apache License 2.0

375 stars 12 forks source link

Title.

Benchmarks: Evaluation Methodologies

[x] G-Eval
[ ] QAG

Coding Ability

[ ] Aider Benchmark
[ ] CodeMMLU

Confabulation Rate

[ ] TruthfulQA
[ ] f

Context Length

[ ] Ruler
[ ] InfiniteBench
[ ] Babilong
[ ] LongICLBench
[ ] HelloBench
[ ] Snorkel Working Memory Test

Creative Writing

[ ] EQ Bench
[ ] f

Pop Culture

[ ] f

Reasoning

[ ] MMLU Pro
[ ] ARC public

Role Playing

Conversational Relevancy
Role Adherence
Knowledge Retention
Conversation Competeness

Summarization

[ ] DeepEval
[ ] Salesforce
[ ] F

Tool Calling

[ ] DeepEval
[ ] Berkeley Function Calling https://gorilla.cs.berkeley.edu/leaderboard.html
[ ] ToolSandbox https://arxiv.org/html/2408.04682v1

Toxicity Testing

[ ] DeepEval
[ ] f

Vibes

[ ] AidanBench
[ ] f

Using an LLM as a Response Judge Some metrics cannot be defined objectively and are particularly useful for more subjective or complex criteria. We care about correctness, faithfulness, and relevance. Answer Correctness - Is the generated answer correct compared to the reference and thoroughly answers the user's query? Answer Relevancy - Is the generated answer relevant and comprehensive? Answer Factfulness - Is the generated answer factually consistent with the context document?

rmusser01 / tldw

Feature Tracker: Evaluation Benchmarks #237