Evaluation doesn't work on Windows

peter-ch commented 3 months ago

After getting a score of 0 every time, I looked at the samples.jsonl_results.jsonl file and the result for each is this: "failed: module 'signal' has no attribute 'setitimer'"

This seems like a Windows/Unix issue.

Ephrem-Adugna commented 2 months ago

Same issue here

mfwong1223 commented 2 months ago

For Windows, I replaced the signal module by the threading module on https://github.com/openai/human-eval/blob/312c5e5532f0e0470bf47f77a6243e02a61da530/human_eval/execution.py#L90-L99 to

import threading
@contextlib.contextmanager
def time_limit(seconds: float):
    def signal_handler():
        raise TimeoutException("Timed out!")
    timer = threading.Timer(seconds, signal_handler)
    timer.start()
    try:
        yield
    finally:
        timer.cancel()

Ephrem-Adugna commented 2 months ago

Above didn't work for me, just ran inside linux vm using wsl

CynicalWilson commented 1 week ago

same issue here. Every LLM I load in LMStudio, and test against HumanEval via the script below, I get 0/0 with the failure being the same module not being found.

HumanEval.py:

import os
import json
from human_eval.data import write_jsonl, read_problems
from human_eval.evaluation import evaluate_functional_correctness
from local_llm_client import client

def generate_one_completion(prompt):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat_completion_create(messages)
    return response['choices'][0]['message']['content']

def generate_completions(problems, output_file):
    samples = []
    for task_id, problem in problems.items():
        prompt = problem["prompt"]
        completion = generate_one_completion(prompt)
        samples.append({"task_id": task_id, "completion": completion})

    write_jsonl(output_file, samples)

if __name__ == "__main__":
    problems = read_problems()
    output_file = "completions.jsonl"

    generate_completions(problems, output_file)

    results = evaluate_functional_correctness(output_file)
    print(json.dumps(results, indent=2))

local_llm_client.py:

import requests
import json

class LocalLLMClient:
    def __init__(self, base_url="http://localhost:4445"):
        self.base_url = base_url

    def chat_completion_create(self, messages, temperature=0.7, max_tokens=-1, stream=False):
        url = f"{self.base_url}/v1/chat/completions"
        headers = {"Content-Type": "application/json"}
        data = {
            "model": "nxcode-cq-7b-orpo-q8_0",  # Adjust this to match your model name
            "messages": messages,
            "temperature": temperature,
            "max_tokens": max_tokens,
            "stream": stream
        }

        response = requests.post(url, headers=headers, json=data)
        response.raise_for_status()
        return response.json()

client = LocalLLMClient()

openai / human-eval

Evaluation doesn't work on Windows #45