tjunlp-lab / Awesome-LLMs-Evaluation-Papers

The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
638 stars 41 forks source link

Why we list inaccessible benchmark? #14

Closed zhimin-z closed 6 months ago

zhimin-z commented 8 months ago

Where are the results of these two benchmarks for domain evaluation in the OpenEval leaderboard? image

allen3ai commented 7 months ago

Thanks for the attention, it was previously because the domain was being filed and is now accessible.

allen3ai commented 7 months ago

In addition, OpenEval is not a benchmark, but a evaluation platform for large Chinese models, including several Chinese benchmarks.

allen3ai commented 7 months ago

Hi, thanks for your questions. For Github's issue, it has not built, because OpenEval still are fixing, and we are discussing about this. Please waiting for this. For two domain benchmarks, "罪名法务智能数据集" is testing now, but we find a better law benchmark for LLMs, so we may take place of this benchmark. And "WGLaw" is tested now, we will release results for a moment. This benchmark is collected from Yellow River Conservancy Commission of the Ministry of Water Resources. But we find general results are still worse, and we will add these new results of domain benchmarks soon.