rmusser01 / tldw

tl/dw (Too Long, Didn't Watch): Your Personal Research Multi-Tool - a naive attempt at 'A Young Lady's Illustrated Primer'
Apache License 2.0
375 stars 12 forks source link

Feature Tracker: Evaluation Benchmarks #237

Open rmusser01 opened 2 months ago

rmusser01 commented 2 months ago

Title.

Benchmarks: Evaluation Methodologies

Coding Ability

Confabulation Rate

Context Length

Creative Writing

Pop Culture

Reasoning

Role Playing

Summarization

Tool Calling

Toxicity Testing

Vibes

rmusser01 commented 1 week ago
Using an LLM as a Response Judge

Some metrics cannot be defined objectively and are particularly useful for more subjective or complex criteria. We care about correctness, faithfulness, and relevance.

    Answer Correctness - Is the generated answer correct compared to the reference and thoroughly answers the user's query?
    Answer Relevancy - Is the generated answer relevant and comprehensive?
    Answer Factfulness - Is the generated answer factually consistent with the context document?