open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.01k stars 425 forks source link

[Feature] Support InfiniteBench #689

Closed Ezra-Yu closed 10 months ago

Ezra-Yu commented 10 months ago

InfiniteBench

OpenBMB has released a new long-context benchmark: https://github.com/OpenBMB/InfiniteBench

Some of its tasks such as Retrieve.PassKey1, Retrieve.Number, Retrieve.KV2 has been used in many papers recently.

Task Name Context # Examples Avg Input Tokens Avg Output Tokens Description
En.Sum Fake Book 103 171.5k 1.1k Summarization of a fake book created with core entity substitution.
En.QA Fake Book 351 192.6k 4.8 Free-form question answering based on the fake book.
En.MC Fake Book 229 184.4k 5.3 Multiple choice questions derived from the fake book.
En.Dia Script 200 103.6k 3.4 Identification of talkers in partially anonymized scripts.
Zh.QA New Book 175 2068.6k 6.3 Question answering on a set of newly collected books.
Code.Debug Code Document 394 114.7k 4.8 Finding which function in a code repo contains an crashing error (in multiple choice form).
Code.Run Synthetic 400 75.2k 1.3 Simulating execution of multiple simple, synthetic functions.
Math.Calc Synthetic 50 43.9k 43.9k Calculations involving super-long arithmetic equations.
Math.Find Synthetic 350 87.9k 1.3 Finding special integers in a lengthy list.
Retrieve.PassKey1 Synthetic 590 122.4k 2.0 Retrieving hidden keys in a noisy long context.
Retrieve.Number Synthetic 590 122.4k 4.0 Locating repeated hidden numbers in a noisy long context.
Retrieve.KV2 Synthetic 500 89.9k 22.7 Finding the corresponding value from a dictionary and a key.

是否希望自己实现该功能?

tonysy commented 10 months ago

739