tjunlp-lab / Awesome-LLMs-Evaluation-Papers

The papers are organized according to our survey: Evaluating Large Language Models: A Comprehensive Survey.
688 stars 42 forks source link

Update README.md #16

Closed john-b-yang closed 10 months ago

john-b-yang commented 10 months ago

Created PR that addresses issue #11.

Also, as one of the WebShop authors, I think "Evaluating LLMs as Agents" is a much more suitable section than tool evalaution. It is the first instantiation of a web-based setting for language agents, and has served as inspiration for later work such as AgentBench and WebArena (one of AgentBench's environments is an adaptation of WebShop).

Would you also mind adding the links for InterCode and SWE-bench to your leaderboard section when #13 is merged? Thanks.

john-b-yang commented 10 months ago

@zhimin-z I see, although to me it feels like some of the leaderboards in the updated README are fairly limited to a single task (i.e. HumanEval)?

Regarding being added to the leaderboard list, I'll leave it up to the authors. But just wanted to point it out for consideration!

zhimin-z commented 10 months ago

@zhimin-z I see, although to me it feels like some of the leaderboards in the updated README are fairly limited to a single task (i.e. HumanEval)?

Regarding being added to the leaderboard list, I'll leave it up to the authors. But just wanted to point it out for consideration!

Good catch. I double confirm with HF maintainers that the HF Humaneval leaderboard is deprecated, so I removed it from https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers/pull/13.

zhimin-z commented 10 months ago

After further examination, I've decided to add https://intercode-benchmark.github.io/ and https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard in https://github.com/tjunlp-lab/Awesome-LLMs-Evaluation-Papers/pull/13, @john-b-yang

john-b-yang commented 10 months ago

Sweet, thanks so much @zhimin-z! Really appreciate the dialogue.