Closed bauersimon closed 3 days ago
@zimmski
with this PR we have the following results for test-driving the new "code-repair" task:
model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/anthropic/claude-3.5-sonnet,golang,golang/mistakes,code-repair,150,130,5,648,10236,718,5,5,5 openrouter/anthropic/claude-3.5-sonnet,golang,golang/plain,write-tests,14,10,1,116,2195,126,1,1,1 openrouter/anthropic/claude-3.5-sonnet,java,java/mistakes,code-repair,394,380,5,997,19647,1021,5,2,2 openrouter/anthropic/claude-3.5-sonnet,java,java/plain,write-tests,14,10,1,213,2278,225,1,1,1 model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/deepseek/deepseek-coder,golang,golang/mistakes,code-repair,150,130,5,638,17993,708,5,5,5 openrouter/deepseek/deepseek-coder,golang,golang/plain,write-tests,14,10,1,74,2725,88,1,1,1 openrouter/deepseek/deepseek-coder,java,java/mistakes,code-repair,400,380,5,943,24188,1003,5,5,5 openrouter/deepseek/deepseek-coder,java,java/plain,write-tests,14,10,1,137,3631,149,1,1,1 model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/google/gemini-flash-1.5,golang,golang/mistakes,code-repair,150,130,5,638,7783,708,5,5,5 openrouter/google/gemini-flash-1.5,golang,golang/plain,write-tests,14,10,1,74,1534,88,1,1,1 openrouter/google/gemini-flash-1.5,java,java/mistakes,code-repair,400,380,5,943,9351,1003,5,5,5 openrouter/google/gemini-flash-1.5,java,java/plain,write-tests,14,10,1,184,1685,196,1,1,1 model,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/openai/gpt-4o,golang,golang/mistakes,code-repair,150,130,5,638,12254,708,5,5,5 openrouter/openai/gpt-4o,golang,golang/plain,write-tests,14,10,1,74,1912,88,1,1,1 openrouter/openai/gpt-4o,java,java/mistakes,code-repair,400,380,5,943,20604,1003,5,5,5 openrouter/openai/gpt-4o,java,java/plain,write-tests,14,10,1,137,2998,149,1,1,1
Sonnet has less score cause it does not always use code fences for the response, i.e. leading to https://github.com/symflower/eval-dev-quality/issues/43
@zimmski
with this PR we have the following results for test-driving the new "code-repair" task:
Sonnet has less score cause it does not always use code fences for the response, i.e. leading to https://github.com/symflower/eval-dev-quality/issues/43