[ ] Useful benchmarks that have human scores beyond AI SOTA. - Google Docs

Useful benchmarks that have human scores beyond AI SOTA

Snippet

Useful benchmarks that have human scores beyond AI SOTA.

Full Content

Useful benchmarks that have human scores beyond AI SOTA.

There are a number of important real-world benchmarks where human performance surpasses the current state-of-the-art (SOTA) in AI:

SuperGLUE: A broad natural language understanding benchmark where expert human performance outperforms the current SOTA AI models.
QuALITY: A reading comprehension dataset where skilled annotators outperform the best AI systems.
BIG-bench: A diverse set of tasks that probe the capabilities of large language models, with many subtasks where humans outperform AI.
HotpotQA: A challenging reading comprehension task where human performance exceeds the best AI models.
SWAG: A commonsense reasoning task where human performance is significantly higher than SOTA AI.
HellaSwag: An extension of SWAG with more challenging examples, where humans again outclass AI.

These benchmarks suggest that there remain significant gaps between current AI capabilities and human-level performance on many real-world tasks. Closing these gaps will be an important area of research going forward.

irthomasthomas / undecidability

Useful benchmarks that have human scores beyond AI SOTA. - Google Docs #954

Useful benchmarks that have human scores beyond AI SOTA

Snippet

Full Content

Suggested labels

None

Related content

812 similarity score: 0.89

940 similarity score: 0.87

953 similarity score: 0.87

810 similarity score: 0.86

951 similarity score: 0.86

706 similarity score: 0.86