symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57 stars 3 forks source link

Deal with failing tests #158

Closed zimmski closed 3 weeks ago

zimmski commented 3 weeks ago
2024/06/04 23:21:41 Querying model "openrouter/meta-llama/llama-3-70b-instruct" with:
        Given the following Go code file "callLoopConditionsOftenEnough.go" with package "light", provide a test file for this code.
        The tests should produce 100 percent code coverage and must compile.
        The response must contain only the test code and nothing else.

        ```golang
        package light

        func callLoopConditionsOftenEnough(x, y int) int {
                if x < 10 || x > 20 {
                        return 0
                }

                for i := 0; i < y; i++ {
                        if i > 20 {
                                x++ // This needs to be executed more than 10 times
                        }
                }

                if x > 20 { // This block needs to be reached for full coverage
                        x = x / 2
                }

                return x
        }

2024/06/04 23:21:48 Model "openrouter/meta-llama/llama-3-70b-instruct" responded (6865 ms) with:

        package light

        import "testing"

        func TestCallLoopConditionsOftenEnough(t *testing.T) {
                type args struct {
                        x int
                        y int
                }
                tests := []struct {
                        name string
                        args
                        want int
                }{
                        {"x less than 10", args{-1, 10}, 0},
                        {"x greater than 20", args{30, 10}, 0},
                        {"i never greater than 20", args{10, 20}, 10},
                        {"i greater than 20, x incremented more than 10 times", args{10, 30}, 15},
                        {"i greater than 20, x incremented exactly 10 times", args{10, 25}, 15},
                }
                for _, tt := range tests {
                        if got := callLoopConditionsOftenEnough(tt.args.x, tt.args.y); got != tt.want {
                                t.Errorf("%q. callLoopConditionsOftenEnough(%v, %v) = %v, want %v", tt.name, tt.args.x, tt.args.y, got, tt.want)
                        }
                }
        }

2024/06/04 23:21:48 $ go mod tidy 2024/06/04 23:21:48 $ symflower test --language golang --workspace /tmp/eval-dev-quality87130698/light --coverage-file /tmp/eval-dev-quality87130698/light/coverage.json cannot check for updates: cannot unmarshal time from lastUpdateChecked file /home/zimmski/.config/symflower/lastUpdateChecked: parsing time "2024-05-13T14:56:10.52553777+02:000": extra text: "0" === RUN TestCallLoopConditionsOftenEnough callLoopConditionsOftenEnough_test.go:23: "i greater than 20, x incremented more than 10 times". callLoopConditionsOftenEnough(10, 30) = 19, want 15 callLoopConditionsOftenEnough_test.go:23: "i greater than 20, x incremented exactly 10 times". callLoopConditionsOftenEnough(10, 25) = 14, want 15 --- FAIL: TestCallLoopConditionsOftenEnough (0.00s) FAIL coverage: 3.3% of statements FAIL light 0.002s

=== Failed === FAIL: . TestCallLoopConditionsOftenEnough (0.00s) callLoopConditionsOftenEnough_test.go:23: "i greater than 20, x incremented more than 10 times". callLoopConditionsOftenEnough(10, 30) = 19, want 15 callLoopConditionsOftenEnough_test.go:23: "i greater than 20, x incremented exactly 10 times". callLoopConditionsOftenEnough(10, 25) = 14, want 15

DONE 1 tests, 1 failure in 0.183s



Tasks:
- [x] Allow failing tests to have coverage points
- [x] Include a short write up with examples for the deep dive blog post
- [x] For a next version add a task for a penalty on generating failing tests. this is a hard nut.