issues
search
symflower
/
eval-dev-quality
DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57
stars
2
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Parallel execution of containerized evaluations
#221
Munsio
opened
2 hours ago
0
Multiple model parameter with same value result in multiple evaluations
#220
Munsio
opened
2 hours ago
0
unable to create temporary repository path: exec: WaitDelay expired before I/O complete
#219
bauersimon
opened
9 hours ago
2
Multiple Runs again
#218
Munsio
opened
20 hours ago
0
Extract model names, to obtain a human-readable name for each model
#217
ruiAzevedo19
closed
4 hours ago
6
Extract model costs into log and CSVs, so the pricing information is always available
#216
ruiAzevedo19
closed
6 hours ago
3
Report the maximum theoretically reachable #files-executed
#215
bauersimon
opened
1 day ago
0
v0.5.0 Evaluation results
#214
bauersimon
closed
1 day ago
0
Chain "write-test" and "code-repair" task with combining different models
#213
bauersimon
opened
1 day ago
3
Correct the tests of the "variable unknown" mistakes case
#212
bauersimon
closed
7 hours ago
1
Docker runtime
#211
Munsio
opened
2 days ago
1
Extract model costs into log and CSVs
#210
bauersimon
opened
2 days ago
0
Do an evaluation run for all "good open weight models" with all available quantizations and different GPUs
#209
zimmski
opened
3 days ago
0
Interactive result comparison
#208
bauersimon
opened
5 days ago
0
Add the current commit revision to the binary, Docker image and reports
#207
zimmski
opened
5 days ago
0
Extract human-readable names for models
#206
bauersimon
closed
4 hours ago
0
Tool/command to combine multiple evaluations into one
#205
bauersimon
opened
6 days ago
1
Keep individual coverage files and LLM query/responses
#204
zimmski
opened
1 week ago
1
refactor, Move fields of "TaskWriteTests" and "TaskCodeRepair" into context object since the tasks itself are stateless
#203
ahumenberger
opened
1 week ago
0
Share logging setup between tasks
#202
bauersimon
opened
1 week ago
0
Evaluation task: Transpile
#201
ruiAzevedo19
opened
1 week ago
2
Follow-up "Code repairing task to enable models to fix code with compilation errors"
#200
ruiAzevedo19
opened
1 week ago
0
Build a docker image to run the evaluation in an isolated environment
#199
Munsio
closed
2 days ago
0
Isolation of evaluations
#198
Munsio
opened
1 week ago
0
Task interface to accommodate different types of tasks
#197
ahumenberger
closed
1 week ago
0
Roadmap issue template and README section about releases
#196
bauersimon
opened
1 week ago
1
Roadmap for v0.6.0
#195
zimmski
opened
1 week ago
0
Evaluation task: TDD
#194
ahumenberger
opened
1 week ago
0
Coverage for Java is tracked for lines, while Go is tracked for ranges
#193
bauersimon
opened
1 week ago
4
Early merger for code repair task
#192
ruiAzevedo19
closed
1 week ago
0
fix, Retry querying Openrouter models cause that sometimes fails
#191
bauersimon
closed
1 week ago
0
fix, Missing CSV header for task
#190
bauersimon
closed
1 week ago
1
Script for sequentially evaluating common models with "light" repository
#189
bauersimon
closed
1 week ago
0
fix, Define a timeout for the "symflower test" command, so ensure the execution does not take too much time
#188
ruiAzevedo19
closed
1 week ago
0
CSV report header is missing the task identifier
#187
bauersimon
closed
1 week ago
0
Openrouter returns 524 when querying models
#186
bauersimon
closed
1 week ago
0
Add timeout to `symflower test`
#185
bauersimon
closed
1 week ago
0
Explicit windows test case for Java test path logic
#184
bauersimon
closed
1 week ago
1
Extend temporary repository tests to Windows
#183
bauersimon
closed
1 week ago
0
fix, Default to all repositories if none are selected in CLI
#182
bauersimon
closed
1 week ago
0
Log model responses directly to file and reuse them for debugging
#181
bauersimon
opened
2 weeks ago
0
fix, Create temporary repositories once
#180
bauersimon
closed
2 weeks ago
1
fix, Rename result directory if it already exists
#179
bauersimon
closed
2 weeks ago
1
Improve maintainability of assessments by abstracting away details of how assessments are stored
#178
ahumenberger
closed
1 week ago
0
fix, Allow arbitrary content immediately after code tag
#177
bauersimon
closed
2 weeks ago
0
If results folder already exists, add suffix but don't overwrite or error
#176
bauersimon
closed
2 weeks ago
0
Collect Go coverage if tests trigger panic
#175
bauersimon
closed
2 weeks ago
1
Deal with dependencies requested by LLMs
#174
ahumenberger
opened
2 weeks ago
2
LLM result parsing bug
#173
bauersimon
closed
2 weeks ago
3
fix, Use a backoff for retrying LLM queries because it seems that some LLMs need longer to recover
#172
zimmski
closed
2 weeks ago
0
Next