New code benchmark - Githubissues

JetBrains-Research has published the benchmark suite Long Code Arena:

paper
HuggingFace page
GitHub repo

The benchmarks are code-related tasks focused on measuring how well models can process large context-windows. They are different from other popular benchmarks both in how large they allow the context to be, and in how realistic they aim to be: the datasets are based on real-world repos, and the tasks replicate real-world scenarios rather than synthetic "evaluation-focused" use-cases.

It is particular relevant to our case because:

it's a great way to evaluate a model's code-assistant capabilities
the approach to building the benchmark-suite could be expanded into additional tasks and programming languages: keeping the focus on realistic tasks and large-context. This would make the suite itself more useful and help evaluating models across more and more features

ibm-granite-community / pm

New code benchmark #108