nuprl / MultiPL-E

A multi-programming language benchmark for LLMs
https://nuprl.github.io/MultiPL-E/
Other
201 stars 38 forks source link

Dev - add Ziglang #94

Closed jelber2 closed 10 months ago

jelber2 commented 1 year ago

Was able to use dataset_builder/all_prepare_prompts.py , to make prompts for Ziglang. Tests will probably fail as I need to figure out proper python to Zig test conversion in dataset_builder/humaneval_to_zig.py and correct format for dataset_builder/terms.csv. Had to convert jsonl to json for automodel.py to work, but it seems that might be taken care of in commit # 0adb7a42f95996e4000c31dfb5e48cd4ac571762

jelber2 commented 1 year ago

Need to figure out the easiest way to install Zig and what version to use for the container.

JohnGouwar commented 1 year ago

Here is a shell script that builds Zig, could certainly be adapted to build in the container:

#!/bin/bash
ZIG_TARBALL="zig-linux-x86_64-0.12.0-dev.167+dd6a9caea.tar.xz"
ZIG_DIR="/zig"
wget https://ziglang.org/builds/$ZIG_TARBALL && \
  mkdir -p $ZIG_DIR && \
  tar xf $ZIG_TARBALL -C $ZIG_DIR --strip-components 1

You would then need to either install the /zig/zig executable somewhere in the PATH, or update PATH to point to /zig.

Where I found the tarball: https://ziglang.org/download/

jelber2 commented 1 year ago

Thanks @JohnGouwar . I had been using https://github.com/tristanisham/zvm , which is convenient for specific versions, but the above you cite is fine - just need to point to whatever you want. One big problem with Zig is that is not stable yet, and there have been many breaking changes that perhaps is more of an issue depending on the LLM's training data cutoff.

jelber2 commented 1 year ago

@andrewrk what version of Zig would you recommend writing HumanEval for given that CodeLlama and Starcoder training data likely cutoff around Zig 0.8.0 - Zig 0.9.0

JohnGouwar commented 1 year ago

@jelber2 Glad to help. My intuition with the tarball approach is that it's probably easier to do non-interactively in a container (though I don't have any personal experience with zvm so that intuition may be inaccurate). I'm not familiar with Zig, but what you could try is to generate ~20 completions with starcoderbase-1b on translated Zig HumanEval prompts, run evaluation on multiple versions and see if you get different results (i.e. the same program fails in one version, but not another) to see which version seems best to use in the container. I imagine that the language features tested by HumanEval and MBPP should be relatively stable.

JohnGouwar commented 1 year ago

Had to make a small merge for ongoing work, just fixed small import conflicts.

jelber2 commented 1 year ago

@jelber2 Glad to help. My intuition with the tarball approach is that it's probably easier to do non-interactively in a container (though I don't have any personal experience with zvm so that intuition may be inaccurate). I'm not familiar with Zig, but what you could try is to generate ~20 completions with starcoderbase-1b on translated Zig HumanEval prompts, run evaluation on multiple versions and see if you get different results (i.e. the same program fails in one version, but not another) to see which version seems best to use in the container. I imagine that the language features tested by HumanEval and MBPP should be relatively stable.

Ok, I'll give this a try when I have a Block of time.

jelber2 commented 1 year ago

Need to do some extensive work on human_eval_to_zig.py. I generated tests with starcoderbase-3b, and they failed using the evaluation container. I just modified the code from humaneval_to_cpp.py for the types in Ziglang without really looking at the json outputs until now. Getting humaneval_to_zig.py to work will take some time.

jelber2 commented 10 months ago

I have pretty much given up on this as I do not have the time to read into the python code for generating the prompts.