Hi TJUNLP team, thanks so much for the great work! I think a paper that presents a holistic view of current NLP benchmarks is relevant amidst the many ongoing efforts.
To this end, I'd like to point out a couple works concerning evaluating language models on the coding related tasks, such as completion, patch generation, and language agents using code as actions.
DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation; paper, site
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?; paper, site
InterCode: Standardizing and Benchmarking Interactive Coding with Execution Feedback; paper, site
Hi TJUNLP team, thanks so much for the great work! I think a paper that presents a holistic view of current NLP benchmarks is relevant amidst the many ongoing efforts.
To this end, I'd like to point out a couple works concerning evaluating language models on the coding related tasks, such as completion, patch generation, and language agents using code as actions.
Thanks in advance!