symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57 stars 3 forks source link

Correct the tests of the "variable unknown" mistakes case #212

Closed bauersimon closed 3 days ago

bauersimon commented 4 days ago

@zimmski

with this PR we have the following results for test-driving the new "code-repair" task:

model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
openrouter/anthropic/claude-3.5-sonnet,golang,golang/mistakes,code-repair,150,130,5,648,10236,718,5,5,5
openrouter/anthropic/claude-3.5-sonnet,golang,golang/plain,write-tests,14,10,1,116,2195,126,1,1,1
openrouter/anthropic/claude-3.5-sonnet,java,java/mistakes,code-repair,394,380,5,997,19647,1021,5,2,2
openrouter/anthropic/claude-3.5-sonnet,java,java/plain,write-tests,14,10,1,213,2278,225,1,1,1
model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
openrouter/deepseek/deepseek-coder,golang,golang/mistakes,code-repair,150,130,5,638,17993,708,5,5,5
openrouter/deepseek/deepseek-coder,golang,golang/plain,write-tests,14,10,1,74,2725,88,1,1,1
openrouter/deepseek/deepseek-coder,java,java/mistakes,code-repair,400,380,5,943,24188,1003,5,5,5
openrouter/deepseek/deepseek-coder,java,java/plain,write-tests,14,10,1,137,3631,149,1,1,1
model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
openrouter/google/gemini-flash-1.5,golang,golang/mistakes,code-repair,150,130,5,638,7783,708,5,5,5
openrouter/google/gemini-flash-1.5,golang,golang/plain,write-tests,14,10,1,74,1534,88,1,1,1
openrouter/google/gemini-flash-1.5,java,java/mistakes,code-repair,400,380,5,943,9351,1003,5,5,5
openrouter/google/gemini-flash-1.5,java,java/plain,write-tests,14,10,1,184,1685,196,1,1,1
model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
openrouter/openai/gpt-4o,golang,golang/mistakes,code-repair,150,130,5,638,12254,708,5,5,5
openrouter/openai/gpt-4o,golang,golang/plain,write-tests,14,10,1,74,1912,88,1,1,1
openrouter/openai/gpt-4o,java,java/mistakes,code-repair,400,380,5,943,20604,1003,5,5,5
openrouter/openai/gpt-4o,java,java/plain,write-tests,14,10,1,137,2998,149,1,1,1

Sonnet has less score cause it does not always use code fences for the response, i.e. leading to https://github.com/symflower/eval-dev-quality/issues/43