symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57 stars 3 forks source link

Coverage for Java is tracked for lines, while Go is tracked for ranges #193

Open bauersimon opened 2 weeks ago

bauersimon commented 2 weeks ago

Looking at similar implementations in both languages (and tests with full 100% coverage reported by gotestsum and maven respectively):

go

package light

func validDate(day int, month int, year int) bool {
    monthDays := []int{31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31}

    if year < 1583 {
        return false
    }
    if month < 1 || month > 12 {
        return false
    }
    if day < 1 {
        return false
    }
    if month == 2 {
        if (year%400) != 0 && (year%4) == 0 {
            if day > 29 {
                return false
            }
        } else {
            if day > 28 {
                return false
            }
        }
    } else {
        if day > monthDays[month-1] {
            return false
        }
    }

    return true
}

result of symflower test:

[
  {
    "FileRange": "light/validateDate.go:12:2-light/validateDate.go:14:3",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:15:2-light/validateDate.go:19:5",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:20:9-light/validateDate.go:23:5",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:25:8-light/validateDate.go:28:4",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:31:2-light/validateDate.go:31:13",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:3:51-light/validateDate.go:8:3",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "light/validateDate.go:9:2-light/validateDate.go:11:3",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  }
]

coverage objects (all entries with count > 0): 7

java

package com.eval;

class ValidDate {
    static boolean validDate(int day, int month, int year) {
        int[] monthDays = {31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31};

        if (year < 1583) {
            return false;
        }
        if (month < 1 || month > 12) {
            return false;
        }
        if (day < 1) {
            return false;
        }
        if (month == 2) {
            if ((year % 400) != 0 && (year % 4) == 0) {
                if (day > 29) {
                    return false;
                }
            } else {
                if (day > 28) {
                    return false;
                }
            }
        } else {
            if (day > monthDays[month-1]) {
                return false;
            }
        }

        return true;
    }
}

result of symflower test

[
  {
    "FileRange": "com/eval/ValidDate.java:10:1-com/eval/ValidDate.java:10:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 8
  },
  {
    "FileRange": "com/eval/ValidDate.java:10:1-com/eval/ValidDate.java:10:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 12
  },
  {
    "FileRange": "com/eval/ValidDate.java:11:1-com/eval/ValidDate.java:11:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:11:1-com/eval/ValidDate.java:11:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 2
  },
  {
    "FileRange": "com/eval/ValidDate.java:13:1-com/eval/ValidDate.java:13:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 7
  },
  {
    "FileRange": "com/eval/ValidDate.java:13:1-com/eval/ValidDate.java:13:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 9
  },
  {
    "FileRange": "com/eval/ValidDate.java:14:1-com/eval/ValidDate.java:14:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:14:1-com/eval/ValidDate.java:14:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:16:1-com/eval/ValidDate.java:16:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 2
  },
  {
    "FileRange": "com/eval/ValidDate.java:16:1-com/eval/ValidDate.java:16:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 12
  },
  {
    "FileRange": "com/eval/ValidDate.java:17:1-com/eval/ValidDate.java:17:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 3
  },
  {
    "FileRange": "com/eval/ValidDate.java:17:1-com/eval/ValidDate.java:17:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 7
  },
  {
    "FileRange": "com/eval/ValidDate.java:18:1-com/eval/ValidDate.java:18:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:18:1-com/eval/ValidDate.java:18:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 3
  },
  {
    "FileRange": "com/eval/ValidDate.java:19:1-com/eval/ValidDate.java:19:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:19:1-com/eval/ValidDate.java:19:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:22:1-com/eval/ValidDate.java:22:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 2
  },
  {
    "FileRange": "com/eval/ValidDate.java:22:1-com/eval/ValidDate.java:22:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 4
  },
  {
    "FileRange": "com/eval/ValidDate.java:23:1-com/eval/ValidDate.java:23:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:23:1-com/eval/ValidDate.java:23:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:27:1-com/eval/ValidDate.java:27:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:27:1-com/eval/ValidDate.java:27:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 3
  },
  {
    "FileRange": "com/eval/ValidDate.java:28:1-com/eval/ValidDate.java:28:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:28:1-com/eval/ValidDate.java:28:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  },
  {
    "FileRange": "com/eval/ValidDate.java:32:1-com/eval/ValidDate.java:32:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:32:1-com/eval/ValidDate.java:32:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 4
  },
  {
    "FileRange": "com/eval/ValidDate.java:4:1-com/eval/ValidDate.java:4:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:4:1-com/eval/ValidDate.java:4:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 11
  },
  {
    "FileRange": "com/eval/ValidDate.java:5:1-com/eval/ValidDate.java:5:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:5:1-com/eval/ValidDate.java:5:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 11
  },
  {
    "FileRange": "com/eval/ValidDate.java:7:1-com/eval/ValidDate.java:7:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 10
  },
  {
    "FileRange": "com/eval/ValidDate.java:7:1-com/eval/ValidDate.java:7:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 12
  },
  {
    "FileRange": "com/eval/ValidDate.java:8:1-com/eval/ValidDate.java:8:99999",
    "CoverageType": "NodeCoverageFalse",
    "Count": 0
  },
  {
    "FileRange": "com/eval/ValidDate.java:8:1-com/eval/ValidDate.java:8:99999",
    "CoverageType": "NodeCoverageTrue",
    "Count": 1
  }
]

coverage objects (all entries with count > 0): 24

bauersimon commented 2 weeks ago

even worse... the go coverage report of symflower test is also wrong... the actual result of gotestsum is:

mode: set
light/validateDate.go:3.51,6.17 2 1
light/validateDate.go:6.17,8.3 1 1
light/validateDate.go:9.2,9.29 1 1
light/validateDate.go:9.29,11.3 1 1
light/validateDate.go:12.2,12.13 1 1
light/validateDate.go:12.13,14.3 1 1
light/validateDate.go:15.2,15.16 1 1
light/validateDate.go:15.16,16.39 1 1
light/validateDate.go:16.39,17.16 1 1
light/validateDate.go:17.16,19.5 1 1
light/validateDate.go:20.9,21.16 1 1
light/validateDate.go:21.16,23.5 1 1
light/validateDate.go:25.8,26.31 1 1
light/validateDate.go:26.31,28.4 1 1
light/validateDate.go:31.2,31.13 1 1

which means that in lines 3-6 there are two statements covered:

But in the json report of symflower test that same range only has one coverage count...

Well... internally we just increment the coverage if the count>0 which... that is also wrong then.

bauersimon commented 2 weeks ago

also the json output of symflower test does not contain all the statements from the actual go coverage report :thinking:

bauersimon commented 2 weeks ago

in theory there are 16 statements in both implementations, so the correct result must be a coverage of 16 for both of them, not 7 and not 24

bauersimon commented 2 weeks ago

Java:

Go: