microsoft / PyCodeGPT

A pre-trained GPT model for Python code completion and generation
MIT License
266 stars 44 forks source link

pass@1 = 1.0 for HumanEval, pass@1 = 0.0 for TorchDataEval #22

Closed lifelongeeek closed 9 months ago

lifelongeeek commented 9 months ago

I am trying to validate evaluation of apicoder.

I simply make a perfect evaluation file by "completion" field in the evaluation file same as "canonical_solutions" in the problem file. However, all of the examples in TorchDataEval failed with "result": "failed: 'NoneType' object is not callable" error, while HumanEval pass all examples. Any suggestion to solve this issue?

I attach the 2 examples in problem & evaluation files for HumanEval & TorchDataEval datasets for reference.

HumanEval

Problem file

{"task_id": "HumanEval/0", "prompt": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given list of numbers, are any two numbers closer to each other than\n    given threshold.\n    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)\n    False\n    >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3)\n    True\n    \"\"\"\n", "entry_point": "has_close_elements", "canonical_solution": "    for idx, elem in enumerate(numbers):\n        for idx2, elem2 in enumerate(numbers):\n            if idx != idx2:\n                distance = abs(elem - elem2)\n                if distance < threshold:\n                    return True\n\n    return False\n", "test": "\n\nMETADATA = {\n    'author': 'jt',\n    'dataset': 'test'\n}\n\n\ndef check(candidate):\n    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.3) == True\n    assert candidate([1.0, 2.0, 3.9, 4.0, 5.0, 2.2], 0.05) == False\n    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.95) == True\n    assert candidate([1.0, 2.0, 5.9, 4.0, 5.0], 0.8) == False\n    assert candidate([1.0, 2.0, 3.0, 4.0, 5.0, 2.0], 0.1) == True\n    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 1.0) == True\n    assert candidate([1.1, 2.2, 3.1, 4.1, 5.1], 0.5) == False\n\n"}
{"task_id": "HumanEval/1", "prompt": "from typing import List\n\n\ndef separate_paren_groups(paren_string: str) -> List[str]:\n    \"\"\" Input to this function is a string containing multiple groups of nested parentheses. Your goal is to\n    separate those group into separate strings and return the list of those.\n    Separate groups are balanced (each open brace is properly closed) and not nested within each other\n    Ignore any spaces in the input string.\n    >>> separate_paren_groups('( ) (( )) (( )( ))')\n    ['()', '(())', '(()())']\n    \"\"\"\n", "entry_point": "separate_paren_groups", "canonical_solution": "    result = []\n    current_string = []\n    current_depth = 0\n\n    for c in paren_string:\n        if c == '(':\n            current_depth += 1\n            current_string.append(c)\n        elif c == ')':\n            current_depth -= 1\n            current_string.append(c)\n\n            if current_depth == 0:\n                result.append(''.join(current_string))\n                current_string.clear()\n\n    return result\n", "test": "\n\nMETADATA = {\n    'author': 'jt',\n    'dataset': 'test'\n}\n\n\ndef check(candidate):\n    assert candidate('(()()) ((())) () ((())()())') == [\n        '(()())', '((()))', '()', '((())()())'\n    ]\n    assert candidate('() (()) ((())) (((())))') == [\n        '()', '(())', '((()))', '(((())))'\n    ]\n    assert candidate('(()(())((())))') == [\n        '(()(())((())))'\n    ]\n    assert candidate('( ) (( )) (( )( ))') == ['()', '(())', '(()())']\n"}
...

Evaluation file

{"task_id": "HumanEval/0", "completion": "    for idx, elem in enumerate(numbers):\n        for idx2, elem2 in enumerate(numbers):\n            if idx != idx2:\n                distance = abs(elem - elem2)\n                if distance < threshold:\n                    return True\n\n    return False\n"}
{"task_id": "HumanEval/1", "completion": "    result = []\n    current_string = []\n    current_depth = 0\n\n    for c in paren_string:\n        if c == '(':\n            current_depth += 1\n            current_string.append(c)\n        elif c == ')':\n            current_depth -= 1\n            current_string.append(c)\n\n            if current_depth == 0:\n                result.append(''.join(current_string))\n                current_string.clear()\n\n    return result\n"}
...

TorchDataEval

Problem file

{"task_id": "TorchDataEval/0", "prompt": "from torchdata.datapipes.iter import IterableWrapper\ndatapipe = IterableWrapper([1,2,3])\n# How to augument the datapipe by repeating it six times.\nnew_datapipe =", "entry_point": "none", "canonical_solution": [" Cycler(datapipe, 6)", " datapipe.cycle(6)"], "test": "\n\nMETADATA = {\n    'author': 'msra-v-dazan',\n    'dataset': 'test',\n    'type': 'Cycler'\n}\n\n\ndef check():\n    assert list(new_datapipe) == [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]\n\n"}
{"task_id": "TorchDataEval/1", "prompt": "from torchdata.datapipes.iter import IterableWrapper\n\ndp = IterableWrapper(['a', 'b', 'c'])\n# Assign indexs to the datepipe object.\nnew_dp =", "entry_point": "none", "canonical_solution": [" dp.enumerate()", " Enumerator(dp)"], "test": "\n\nMETADATA = {\n    'author': 'msra-v-dazan',\n    'dataset': 'test',\n    'type': 'Enumerator'\n}\n\n\ndef check():\n    assert list(new_dp) == [(0, 'a'), (1, 'b'), (2, 'c')]\n\n"}
...

Evaluation file

{"task_id": "TorchDataEval/0", "completion": " datapipe.cycle(6)"}
{"task_id": "TorchDataEval/1", "completion": " dp.enumerate()"}
...
NL2Code commented 9 months ago

You can check whether the version of torchdata is correct. Also, you can manually run the prompt and canonical_solutions for each problem to see if they run successfully and then debug it.

lifelongeeek commented 9 months ago

@NL2Code I concatenateprompt and one of canonical_solutions of TorchDataEval. 82% of cases passed without error. However, in 18% of cases, execution failed with undefined objects. Followings are the lists of them

["failed: name 'Cycler' is not defined"]
["failed: name 'Enumerator' is not defined"]
["failed: name 'Demultiplexer' is not defined"]
["failed: name 'Forker' is not defined"]
["failed: name 'IterKeyZipper' is not defined"]
["failed: name 'MapKeyZipper' is not defined"]
["failed: name 'UnZipper' is not defined"]
["failed: name 'collated_ds' is not defined"]
["failed: name 'Grouper' is not defined"]
["failed: name 'FlatMapper' is not defined"]
["failed: name 'Mapper' is not defined"]
["failed: name 'Filter' is not defined"]
["failed: name 'Header' is not defined"]
["failed: name 'Rows2Columnar' is not defined"]
["failed: name 'Concater' is not defined"]
["failed: name 'Collator' is not defined"]

Do I need to import an external codebase to use these objects?

NL2Code commented 9 months ago

No additional imports are required. I suspect the issue might be with the version of torchdata you have installed. I recommend trying out different versions.