symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
128 stars 5 forks source link

Evaluation task: Code repair #168

Closed ruiAzevedo19 closed 3 months ago

ruiAzevedo19 commented 3 months ago

Goal

Given source code with compilation errors, the model needs to repair the code such that the source code compiles. The response is validated by executing predefined tests making sure that the implementation itself is not altered.

PRs

Follow-up