nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
200 stars 38 forks source link

Add support for the Dart language #152

Closed devoncarew closed 1 month ago

devoncarew commented 3 months ago

Hi, I'm curious what the process is for adding support for a new language to the translator / benchmarks (specifically, support for Dart / dart.dev). I did find this: https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/README.md, but was wondering if there was more detailed information and / or a few good representative PRs.

arjunguha commented 3 months ago

Thanks for your interest. We'd love to include Dart. Based on what I know of Dart, I suggest adapting the TypeScript translator:

https://github.com/nuprl/MultiPL-E/blob/main/dataset_builder/humaneval_to_ts.py

Once you've written humaneval_to_dart.py, the simplest way to spot check it is using test.py in the same directory. This is an example on the simplest problem:

test.py  humaneval_to_ts ../datasets/originals/HumanEval_53_add.py

You'll get three outputs, separated with **** (1) the Dart prompt, (2) the Dart test suite, and (3) the list of stop tokens.

Before trying to benchmark, it is best to spot-check on a diverse set of problems that exercise different parts of the translator. These are the HumanEval problems that we use to spot-check:

The final part is writing a little execution script. I would adapt the TypeScript execution script as well:

https://github.com/nuprl/MultiPL-E/blob/main/evaluation/src/eval_ts.py

devoncarew commented 3 months ago

Thanks for the response! I'll look into the above and ping if I hit issues or make progress.

arjunguha commented 1 month ago

Some results that seem reasonable to me:

Dataset,Pass@k,Estimate,NumProblems,MinCompletions,MaxCompletions
dart_prompts-deepseekcoder_v2lite_base-0.2-reworded,1,0.25,43,50,50
dart_prompts-starcoder2_15b-0.2-reworded,1,0.35,157,50,50
dart_prompts-starcoderbase-0.2-reworded,1,0.18,157,50,50

I've also posted them here:

https://huggingface.co/spaces/nuprl/MultiPL-E

Next up is a release, including adding to https://github.com/bigcode-project/bigcode-evaluation-harness/

I'll get to that within a week or so.

Thanks for the PR!