Code-Related Benchmarks

john-b-yang commented 11 months ago

Hi TJUNLP team, thanks so much for the great work! I think a paper that presents a holistic view of current NLP benchmarks is relevant amidst the many ongoing efforts.

To this end, I'd like to point out a couple works concerning evaluating language models on the coding related tasks, such as completion, patch generation, and language agents using code as actions.

DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation; paper, site
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?; paper, site
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback; paper, site

Thanks in advance!

allen3ai commented 10 months ago

We have added these paper into our paperlist and will update our survey including them in next version.

john-b-yang commented 10 months ago

Sweet, thanks so much for the update!

tjunlp-lab / Awesome-LLMs-Evaluation-Papers

Code-Related Benchmarks #11