symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57 stars 3 forks source link

Extract model costs into log and CSVs, so the pricing information is always available #216

Closed ruiAzevedo19 closed 2 days ago

ruiAzevedo19 commented 3 days ago

Part of #210

bauersimon commented 3 days ago

Please try this out with the cheapest model from openrouter and post the CSV here to see how it looks like.

ruiAzevedo19 commented 2 days ago

@bauersimon These are the results

evaluation.csv ``` model,cost,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/meta-llama/llama-3-8b-instruct,0.00000014,golang,golang/plain,write-tests,1,0,0,87,1186,90,1,0,0 ```
evaluation.log ``` (...) 2024/06/26 10:35:47 Evaluation score for "openrouter/meta-llama/llama-3-8b-instruct" ("response-no-code"): cost=0.00, score=1, coverage=0, files-executed=0, generate-tests-for-file-character-count=87, processing-time=1186, response-character-count=90, response-no-error=1, response-no-excess=0, response-with-code=0 ```
golang-summed.csv ``` model,cost,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/meta-llama/llama-3-8b-instruct,0.00000014,1,0,0,87,1186,90,1,0,0 ```
models-summed.csv ``` model,cost,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/meta-llama/llama-3-8b-instruct,0.00000014,1,0,0,87,1186,90,1,0,0 ```
bauersimon commented 2 days ago

Awesome. The cost in the log is kinda useless but it should be higher for more expensive models anyways. Just need to remember to scale them up for our evaluations then.