symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
128 stars 5 forks source link

Improve maintainability of assessments #169

Closed ahumenberger closed 3 months ago

ahumenberger commented 3 months ago

As of now assessments are stored in a nested map type AssessmentPerModelPerLanguagePerRepository map[model.Model]map[language.Language]map[string]metrics.Assessments, and this type is used as is for storing and retrieving assessments. This is super hard to maintain and extend. E.g for #165 we need to store assessments per task, extending the above nested map is super tedious since basically every test asserts such a map, and every single one of them has to be changed.

I'm proposing to make the usage independent of how the assessments are stored. My proposal includes a type

type AssessmentTuple struct {
  Model  model.Model
  Language language.Language
  Repository string
}

and a storage type

type AssessmentStore struct {
  ...
}

func (as *AssessmentStore) Store(tuple *AssessmentTuple) {
  ...
}

This makes it possible to specify a list of expected assessment tuples in a test without the need of dealing with AssessmentPerModelPerLanguagePerRepository. And this highly improves maintainability.

Related tasks:

bauersimon commented 3 months ago

So the tuple holds the information with which model/language/repository (and in the future: task) an assessment is associated? But where is the actual assessment stored?

ahumenberger commented 3 months ago

So the tuple holds the information with which model/language/repository (and in the future: task) an assessment is associated? But where is the actual assessment stored?

Well the whole point is that this does not matter, AssessmentStore should take care of that. And for the time being, AssessmentStore will just use a nested map to store stuff. But callers should not care at all how it is stored.

I'm currently working on a version for thsi where AssessmentTuple is just used in testing, to make tests more maintainable. And AssessmentStore looks like this:

type AssessmentStore struct {
    store map[model.Model]map[language.Language]map[string]metrics.Assessments
}