open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.22k stars 451 forks source link

[Feature] Support BigCodeBench #1460

Open terryyz opened 3 months ago

terryyz commented 3 months ago

Describe the feature

BigCode (Hugging Face and ServiceNow Research) released a new large-scale benchmark, BigCodeBench, for code generation with diverse function calls and complex instructions, covering 1140 expert-annotated tasks. It has been officially used by DeepSeek and CodeGeeX4. BigCodeBench is considered a better alternative for HumanEval and other function-level code generation benchmarks (see here).

Will you implement it?

tonysy commented 3 months ago

Thanks for the suggestions, would you like to contribute this bench into OpenCompass?

terryyz commented 3 months ago

Thanks for asking! I'm quite busy these two months, but I might be able to take a look in October if no one else has the bandwidth to add the bench.