infi-coder / inficoder-eval-framework

The evaluation framework for the InfiCoder-Eval benchmark.
Apache License 2.0
17 stars 1 forks source link

Discussion on InfiCoder-Eval Evaluation Framework #1

Open llylly opened 5 months ago

llylly commented 5 months ago

Please feel free to discuss in this thread anything about the InfiCoder-Eval evaluation framework. We welcome any feedback and comments!

xieqk commented 5 months ago

请问能提供评测的Docker镜像吗?

superkido511 commented 4 months ago

Could you provide more information about blank filling and keyword matching metric? How are they calculated?

llylly commented 4 months ago

Could you provide more information about blank filling and keyword matching metric? How are they calculated?

Thanks for your interest in our benchmark! Generally, for blank filling and keyword matching, domain experts provide a customized list of goal blank answers / goal keywords for each question.

Note that there are multiple blanks / keywords for each problem. By default, the problem score (1.0 point) is allocated evenly for each blank or keyword. But customized weighting may exist. You can browse the concrete evaluation criteria by browsing all eval_*.yaml in https://github.com/infi-coder/inficoder-eval-framework/tree/main/cases_dev. Feel free to let me know if you have further questions or comments!

llylly commented 4 months ago

请问能提供评测的Docker镜像吗?

(Question translation: Could you provide the Docker image for evaluation?)

谢谢你对我们评测框架的关注!我们正在准备发布公开的Docker镜像,在我们内部评测时,使用的Dockerfile配置如下,暂时可以使用此配置来复现评测结果:

Thanks for your interest in our evaluation benchmark! We are preparing to release the public Docker image. Before the release, you can use the following Dockerfile (which we have deployed internally) to reproduce our evaluation results:

# Use any other Ubuntu 20.04 base image should also be fine --- actually pytorch or gpu is not needed unless you run inference in the same instance
FROM hunterddm/nvidia_pytorch

# Set the timezone
ARG TZ=Etc/UTC
RUN ln -snf /usr/share/zoneinfo/$TZ /etc/localtime && echo $TZ > /etc/timezone

# Update the package index and install necessary packages
RUN apt-get update && \
    apt-get install -y \
        wget

# install lsof
RUN apt-get update install lsof

# Install mono for C# environment
RUN apt-get update && \
    apt-get install -y mono-complete

# Install Go
RUN apt-get install -y golang-go

# Install R and dependencies
# Update the package index and install necessary packages
RUN apt-get install -y software-properties-common && \
    apt-get update && \
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 51716619E084DAB9 && \
    add-apt-repository "deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/" && \
    apt-get update
RUN apt-get upgrade -y
RUN apt install -t focal-cran40 -y r-base r-base-dev
RUN apt-get install -y libssl-dev libfontconfig1-dev \
    libcurl4-openssl-dev libxml2-dev libharfbuzz-dev libfribidi-dev \
    libfreetype6-dev libpng-dev libtiff5-dev libjpeg-dev

# Install R packages
RUN Rscript -e 'install.packages(c("assert", "stringr", "tidyverse", "dplyr", "data.table"), repos="https://cloud.r-project.org/")'

# install and link node.js
RUN apt-get update && apt-get install -y npm
# use nvm to install node.js to get higher version
RUN curl https://raw.githubusercontent.com/creationix/nvm/master/install.sh | bash
ENV NVM_DIR="/root/.nvm"
RUN [ -s "/root/.nvm/nvm.sh" ] && \. "/root/.nvm/nvm.sh" && nvm install v16.15.1 && which node
RUN ln -s /usr/local/nvm/versions/node/v16.15.1/bin/node /usr/bin/node &&
    ln -s /usr/local/nvm/versions/node/v16.15.1/bin/npm /usr/bin/npm &&
    ln -s /usr/local/nvm/versions/node/v16.15.1/bin/npx /usr/bin/npx

# Set up NPM global directory and add it to PATH
RUN mkdir ~/.npm-global && \
    npm config set prefix '~/.npm-global' && \
    echo 'export PATH=~/.npm-global/bin:$PATH' >> ~/.profile && \
    . ~/.profile

# Install global NPM packages
RUN npm install -g jsdom@17.0.0 typescript && \
    export NODE_PATH=$(npm root --quiet -g)

# Install Java and C++ environments
RUN apt-get install -y libboost-all-dev default-jdk python3-setuptools

然后,在运行主模块前,在同一个shell里我们先执行以下命令: Then, before running the main module, in the same command shell, we execute the following prepend commands:

sudo chmod -R 777 /root/
export PATH=/root/.npm-global/bin:$PATH
npm config set prefix '/root/.npm-global'
export NODE_PATH="/root/.npm-global/lib/node_modules"
pip install -U pip setuptools
pip install -r requirements.txt

我们正在测试以上的配置文件及命令在公开环境的可用性。如果你有任何反馈和建议,或者有意向参与,欢迎联系我们。 We are testing the compatibility of the above configuration files and commands in the public domain. If you have some feedback for us or would like to contribute, please feel free to let us know.

gblazex commented 4 months ago

Any plans to test these models? Magicoder-S-DS-6.7B Phind-CodeLlama-34B-v2 DeepSeek-Coder-1.3B-instruct

They are very high on eval plus: https://evalplus.github.io/leaderboard.html

superkido511 commented 4 months ago

@llylly Does this benchmark really include C coding problems? In the readme page, it's stated that "Featuring the execution runtime for 8 languages (Python, Javascript, Java, C, C++, Go, R, C#), given model responses, the framework can directly evaluate and output the scores along with subscores in a nice table." The data from MultiPL-E doesn't contains C problems

ajinkya123-robo commented 2 months ago

Hello, This is a great initiative. Where can i submit my models for evolutions. Thanks