Extract model costs into log and CSVs, so the pricing information is always available

ruiAzevedo19 commented 3 days ago

Part of #210

bauersimon commented 3 days ago

Please try this out with the cheapest model from openrouter and post the CSV here to see how it looks like.

ruiAzevedo19 commented 2 days ago

@bauersimon These are the results

Command: eval-dev-quality evaluate --runs 1 --repository golang/plain --model openrouter/meta-llama/llama-3-8b-instruct
Model info: https://openrouter.ai/models/meta-llama/llama-3-8b-instruct
Current price:
- Prompt: $0.07/M input tokens = $0.00000007/input token
- Completion: $0.07/M output tokens = $0.00000007/output token
- Total cost should be: $0.00000014

evaluation.csv

``` model,cost,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/meta-llama/llama-3-8b-instruct,0.00000014,golang,golang/plain,write-tests,1,0,0,87,1186,90,1,0,0 ```

evaluation.log

``` (...) 2024/06/26 10:35:47 Evaluation score for "openrouter/meta-llama/llama-3-8b-instruct" ("response-no-code"): cost=0.00, score=1, coverage=0, files-executed=0, generate-tests-for-file-character-count=87, processing-time=1186, response-character-count=90, response-no-error=1, response-no-excess=0, response-with-code=0 ```

golang-summed.csv

``` model,cost,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/meta-llama/llama-3-8b-instruct,0.00000014,1,0,0,87,1186,90,1,0,0 ```

models-summed.csv

``` model,cost,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code openrouter/meta-llama/llama-3-8b-instruct,0.00000014,1,0,0,87,1186,90,1,0,0 ```

bauersimon commented 2 days ago

Awesome. The cost in the log is kinda useless but it should be higher for more expensive models anyways. Just need to remember to scale them up for our evaluations then.

symflower / eval-dev-quality

Extract model costs into log and CSVs, so the pricing information is always available #216