Useful benchmarks that have human scores beyond AI SOTA
Snippet
Useful benchmarks that have human scores beyond AI SOTA.
Full Content
Useful benchmarks that have human scores beyond AI SOTA.
There are a number of important real-world benchmarks where human performance surpasses the current state-of-the-art (SOTA) in AI:
SuperGLUE: A broad natural language understanding benchmark where expert human performance outperforms the current SOTA AI models.
QuALITY: A reading comprehension dataset where skilled annotators outperform the best AI systems.
BIG-bench: A diverse set of tasks that probe the capabilities of large language models, with many subtasks where humans outperform AI.
HotpotQA: A challenging reading comprehension task where human performance exceeds the best AI models.
SWAG: A commonsense reasoning task where human performance is significantly higher than SOTA AI.
HellaSwag: An extension of SWAG with more challenging examples, where humans again outclass AI.
These benchmarks suggest that there remain significant gaps between current AI capabilities and human-level performance on many real-world tasks. Closing these gaps will be an important area of research going forward.
Useful benchmarks that have human scores beyond AI SOTA
Snippet
Useful benchmarks that have human scores beyond AI SOTA.
Full Content
Useful benchmarks that have human scores beyond AI SOTA.
There are a number of important real-world benchmarks where human performance surpasses the current state-of-the-art (SOTA) in AI:
These benchmarks suggest that there remain significant gaps between current AI capabilities and human-level performance on many real-world tasks. Closing these gaps will be an important area of research going forward.
Suggested labels
None