[ ] For non-selective evaluations exklude certain models e.g. "openrouter/auto" needs to go, because it is not a real model, is just forwarding to a model automatically https://github.com/symflower/eval-dev-quality/issues/126
[ ] Think about exlcuding the "perplexicty" models because they have a "per request" cost, and they are the only ones that do that.
[ ] Snowflake against Databricks would be a nice comparision since they align company-wise and are new
[ ] Include more models (We have the main problem that there are multiple models coming out every day. We should not wait for a "new version" of the eval, we should test these models right away and compare them. big problem: how do we promote findings?)
[ ] Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite.
[ ] write out results right away so we don't loose anything if the evaluation crashes
[ ] Remove absolute paths completely e.g. in stack traces too.
[ ] Automatically interpret "Extra code" #44
[ ] Figure out the "perfect" coverage score so we can display percentage of coverage reached
[ ] Make coverage metric fair
[ ] "Looking through logs... Java consistently has more code than Go for the same tasks, which yields more coverage. So a model that solves all Java tasks but no Go is automatically higher ranked than the opposite." -> only Symflower coverage will make this fair
[ ] Save the descriptons of the models as well: https://openrouter.ai/api/v1/models The reason is that these can change over time, and we need to know after a while what they where. e.g right now i would like to know if mistral-7b-instruct for the last evaluation was v0.1. or not
[ ] Bar charts should have have their value on the bar. The axis values do not work that well
[ ] Pick an example or several examples per category: goal is to find interesting results automatically, because it will get harder and harder to go manually through results.
[ ] Charts to showcase data
[ ] Total-scores vs costs scatterplot. Result is upper-left-corner sweat spot: cheap and good results.
[ ] Piechart of whole evaluations costs: for each LLM show how much it costs. Result is to see which LLMs are costing the most to run the eval.
[ ] Reporting and documentation on writing deep-dives
[ ] What are results that align with expectations? what are results against expectations? E.g. are there small LLMs that are better than big ones?
[ ] Are there big LLMs that totally fail?
[ ] Are there small LLMs that are surprisingly good?
[ ] What about LLMs where the commonity doesn't know that much yet: e.g. Snowflake, DBRX, ...
[ ] Order models by open.weight, allows commercial-use, closed, and price(!) and size: e.g. https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1 is great because open-weight, and Apache2 so commerical-use allowed. Should be better rated than GPT4
[ ] distinguish between latency (time-to-first-token) and throughput (tokens generated per second)
[ ] Documentation
[ ] Clean up and extend README
[ ] Better examples for contributions
[ ] Overhaul explanation of "why" we need evaluation, i.e. why is it good to evaluate for an empty function that does nothing.
[ ] Write down a playbook for evaluations, e.g. one thing that should happen is that we let the benchmark play 5 times and then sum up points, but ... the runs should have at least one hour berak in between to not run into cached responses.
[ ] Write Tutorial for using Ollama
[ ] YouTube video for using Ollama
[ ] Tooling & Installation
[ ] Rescore existing models / eval with fixes e.g. when we do a better code repair tool, the LLM answer did not change, so we should rescore right away with new version of tool over a whole result of an eval.
[ ] Automatic tool installation with fixed version
[ ] Go
[ ] Java
[ ] Ensure that non-critical CLI input validation (such as unavailable models) does not panic
[ ] Automatically updated leader board for this repository: #26
[ ] Take a look at current leaderboards and evals to know what could be interesting Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
[ ] Add an app-name to the requests so people know we are the eval https://openrouter.ai/docs#quick-start shows that other openapi-packages implement custom headers, but the one Go package we are using does not implement that. So do a PR to contribute.
[ ] Add evaluation task for "querying the relative test file path of a relative implementation file path" e.g. "What is the test relative file path for some/implementation/file.go" ... it is "some/implementation/file_test.go" for most cases.
[ ] Add evaluation task for transpilation Go->Java and Java->Go
[ ] Add evaluation task for code refactoring: two function with the same code -> extract into a helper function
[ ] Add evaluation task for implementing and fixing bugs using TDD
[ ] Scoring, Categorization, Bar Charts split by language.
[ ] Check determinism of models e.g. execute each plain repository X-times, and then check if they are stable.
[ ] Code repair
[ ] Own task category
[ ] 0-shot, 1-shot, ...
[ ] With LLM repair
[ ] With tool repair
[ ] Do test file paths through
[ ] symflower symbols
[ ] Task for models
[ ] Query REAL costs of all the testing of a model: the reason this is interesting is that some models have HUGE outputs, and since more output means more costs, this should be addressed in the score.
[ ] Move towards generated cases so models cannot integrate fixed cases to always have 100% score
[ ] Think about adding more trainings data generation features: This will also help with dynamic cases
[ ] Heard that Snowflake Arctic is very open with how they gathered trainings data... so we see what LLM creators think and want of trainings data
[ ] Think about a commercial effort of the eval, that we can balance some of the costs that goes into maintaining this eval
[ ] Benchmark that showcases base-models vs their fine-tuned coding model e.g. in v0.5.0 we see that Codestral, codellama, ... are worse
TODO sort and sort out
Model
andProvider
to be in the same package https://github.com/symflower/eval-dev-quality/pull/121#discussion_r1603371915Current popular code leaderboards are [LiveCodeBench](https://huggingface.co/spaces/livecodebench/leaderboard), the [BigCode models leaderboard](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard), [CyberSecEval](https://huggingface.co/spaces/facebook/CyberSecEval) and [CanAICode](https://huggingface.co/spaces/mike-ravkine/can-ai-code-results)
No test files
actually identify and error that there are no test files (needs to be implemented insymflower test
)symflower symbols
to receive filessymflower symbols