salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.74k stars 401 forks source link

CodeT5+ 770m HumanEval pass@k cannot reproduce #138

Closed VichyTong closed 1 year ago

VichyTong commented 1 year ago

Hi there, I just cloned this repo and modified the run_generate.sh like this to reproduce a CodeT5p-770m model pass@k results on HumanEval:

model=codet5p-770m
temp=0.2
max_len=800
pred_num=200
num_seqs_per_iter=25 # 25 for 350M and 770M, 10 for 2B, 8 for 6B, 2 for 16B on A100-40G

output_path=preds/${model}_T${temp}_N${pred_num}

mkdir -p ${output_path}
echo 'Output path: '$output_path
echo 'Model to eval: '$model

# 164 problems, 21 per GPU if GPU=8
index=0
gpu_num=2
for ((i = 0; i < $gpu_num; i++)); do
  start_index=$((i * 84))
  end_index=$(((i + 1) * 84))

  gpu=$((i))
  echo 'Running process #' ${i} 'from' $start_index 'to' $end_index 'on GPU' ${gpu}
  ((index++))
  (
    CUDA_VISIBLE_DEVICES=$gpu python generate_codet5p.py --model Salesforce/${model} \
      --start_index ${start_index} --end_index ${end_index} --temperature ${temp} \
      --num_seqs_per_iter ${num_seqs_per_iter} --N ${pred_num} --max_len ${max_len} --output_path ${output_path}
  ) &
  if (($index % $gpu_num == 0)); then wait; fi
done

After the code generating, I ran the run_eval.sh and got this result:

(code) root@906471c230a3:~/origin/CodeT5+/humaneval# bash run_eval.sh 
Output path: preds/codet5p-770m_T0.2_N200/
164 files in preds/codet5p-770m_T0.2_N200/
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 164/164 [00:00<00:00, 356.38it/s]
save to preds/codet5p-770m_T0.2_N200/.jsonl
Reading samples...
32800it [00:01, 29740.91it/s]
Running test suites...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32800/32800 [14:15<00:00, 38.33it/s]
Writing results to preds/codet5p-770m_T0.2_N200/.jsonl_results.jsonl...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32800/32800 [00:00<00:00, 32860.55it/s]
{'pass@1': 0.018750000000000003, 'pass@10': 0.03140268086950306, 'pass@100': 0.04245913332052898}

That's a quite big difference from the result in the paper that shows CodeT5p-770m can make 15.5, 27.2, and 42.7 at pass@1, pass@10, and pass@100, respectively. I tried to find out the reason, and in the generating code snippets, I found a huge amount of garbage code like in the 0.jsonl:

0.jsonl (partial)


{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "package com.github.droidkaigi.android.glide.load.resource.drawable;\n\nimport android.graphics.drawable.Drawable;\nimport android.os.Build;\nimport android.support.anno>
{"task_id": "HumanEval/0", "completion": "<?php\n\nnamespace AppBundle\\Entity;\n\nuse Doctrine\\ORM\\Mapping as ORM;\n\n/**\n * @ORM\\Entity\n * @ORM\\Table(name=\"user\")\n */", "all_code": "from typing impor>
{"task_id": "HumanEval/0", "completion": "#include \"stdafx.h\"", "all_code": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given >
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "\ufeffusing System;\nusing System.Collections.Generic;\nusing System.Linq;\nusing System.Text;\nusing System.Threading.Tasks;\n\nnamespace _03.Sum_of_N_Numbers\n{\n    >
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2020, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "// Copyright (c) Microsoft Corporation. All rights reserved.\n// Licensed under the MIT License.\n\n// <auto-generated/>\n", "all_code": "from typing import List\n\n\nd>
{"task_id": "HumanEval/0", "completion": "package com.github.dannil.scbjavaclient.model;\n\nimport com.github.dannil.scbjavaclient.util.JsonUtil;\nimport com.google.gson.annotations.SerializedName;\n\nimport ja>
{"task_id": "HumanEval/0", "completion": "<?php\n\nnamespace App\\Http\\Controllers;\n\nuse App\\Http\\Requests;\nuse App\\Http\\Controllers\\Controller;\nuse App\\User;\nuse Illuminate\\Http\\Request;\nuse Ill>
{"task_id": "HumanEval/0", "completion": "\ufeffusing System;\nusing System.Collections.Generic;\nusing System.Linq;\nusing System.Text;\nusing System.Threading.Tasks;\n\nnamespace _01.Sum_of_N_Numbers\n{\n    >
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2020, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2020, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "\ufeffusing System;\nusing System.Collections.Generic;\nusing System.Linq;\nusing System.Text;\nusing System.Threading.Tasks;\n\nnamespace _04.Find_Largest_Number\n{\n >
{"task_id": "HumanEval/0", "completion": "\ufeffusing System.Reflection;\nusing System.Runtime.CompilerServices;\nusing System.Runtime.InteropServices;\n\n// General Information about an assembly is controlled >
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "<?php\n\nnamespace App\\Http\\Controllers\\Admin;\n\nuse App\\Http\\Controllers\\Controller;\nuse App\\Http\\Requests;\nuse App\\Http\\Requests\\Admin\\CreateUserReques>
{"task_id": "HumanEval/0", "completion": "#include <stdio.h>", "all_code": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in given lis>
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "#!/usr/bin/env python3\n", "all_code": "from typing import List\n\n\ndef has_close_elements(numbers: List[float], threshold: float) -> bool:\n    \"\"\" Check if in giv>
{"task_id": "HumanEval/0", "completion": "/*\n * Copyright (c) 2008-2021, Hazelcast, Inc. All Rights Reserved.\n *\n * Licensed under the Apache License, Version 2.0 (the \"License\");\n * you may not use this >
{"task_id": "HumanEval/0", "completion": "<?php\n\nnamespace App\\Http\\Controllers\\Admin;\n\nuse App\\Http\\Controllers\\Controller;\nuse Illuminate\\Http\\Request;\nuse Illuminate\\Support\\Facades\\Auth;\nu>
{"task_id": "HumanEval/0", "completion": "<?php\n\nnamespace AppBundle\\Entity;\n\nuse Doctrine\\ORM\\Mapping as ORM;\n\n/**\n * @ORM\\Entity\n * @ORM\\Table(name=\"user\")\n */", "all_code": "from typing impor>
{"task_id": "HumanEval/0", "completion": "// Copyright (c) Microsoft Corporation. All rights reserved.\n// Licensed under the MIT License.\n\n// <auto-generated/>\n", "all_code": "from typing import List\n\n\nd>


So I wonder if my hypermeter or prompt does not work well with CodeT5+ 770m model? Or my reproducing method is wrong, like using the wrong script or forgetting to set some parameters?
yuewang-cuhk commented 1 year ago

Hi there, for CodeT5+ 220M/770M results on HumanEval, please use the checkpoints that are further pretrained on the Python subset: codet5p-220m-py and codet5p-770m-py.

VichyTong commented 1 year ago

Thank you very much. Solved my problem.