symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57 stars 3 forks source link

Apply symflower fix to a "write-test" result of a model #213

Open bauersimon opened 4 days ago

bauersimon commented 4 days ago

Basically we want to execute the "write-test" task, but then optionally call symflower repair on the generated tests. Plus, the scoring should treat both the original "write-test" and the (hopefully) repaired tests as different results.

bauersimon commented 3 days ago

@Munsio plz review

Munsio commented 3 days ago
bauersimon commented 3 days ago

The task evaluation logic can carry over the "broken code" and just ask the next model to do a "repair". The models themselves don't need to worry about sharing context or what they need to do. The evaluation logic will say "you failed, but you, please fix, this is what we have so far".