symflower / eval-dev-quality

DevQualityEval: An evaluation benchmark 📈 and framework to compare and evolve the quality of code generation of LLMs.
https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/
MIT License
57 stars 3 forks source link

Evaluation task: Transpile #201

Open ruiAzevedo19 opened 1 week ago

ruiAzevedo19 commented 1 week ago

Goal

Given a source code, the model needs to transpile it from Java to Go, and from Go to Java. The response is validated by executing predefined tests making sure that the implementation is correct.

TODOs

bauersimon commented 1 week ago

So in this case a test case contains the implementation in Language A and a test file in language B, right?

We need to ensure that the transpiled implementation in language B actually works with the tests. So we need to show the model the signature in language B that it needs to conform to. So I would also have for each example an implementation file B that already contains the implementation signature and show that in the prompt.

bauersimon commented 1 week ago

I think the refactoring of the prompting is a bit too ambitious. Really the only thing that changes for each task is the prompt template and the context (and we even embed parts of the context like the source file to deduplicate code).

Would be nice to have a helper function that just takes both these things and applies the context to the template. But I think maybe we can get away with having an any argument for the context... cause it is just a context, there is no common method for that. And the templating will anyways fail if it cannot find the context values it needs, so that happens already - not really necessary to add another check for that. Not sure if we need to introduce an interface at all.