Closed shivamag125 closed 5 months ago
I use the instruct model , only got {'pass@1': 0.05227492739593417}
:(
But I use the raw <PRE>
<SUF>
<MID>
tokens, as my test it works fine than <FILL_ME>
Same issues. Cannot reproduce the infilling results as paper reported, a bit lower. Any ideas?
Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!
@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.
@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.
Dear @shivamag125 , @timxx and @stgzr, thanks for reporting!
@timxx : The instruction models are not intended to be used for infilling, please use the pretrained models.
@shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.
Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using \<EOT>?
Thanks! Using a stopping condition like \n reproduces the numbers.
Dear @shivamag125 , @timxx and @stgzr, thanks for reporting! @timxx : The instruction models are not intended to be used for infilling, please use the pretrained models. @shivamag125 and @stgzr : The hyperparameters (greedy decoding i.e. temperature=0) are reported in the paper (Table 14). Note that you need to compare to the models with LCFT in the table since pretrained models without LCFT have not been released. Moreover, a frequent problem for infilling models is knowing where to stop. Our code cuts the generation after the first linebreak in the single line infilling task.
Thank you for the detailed reply. I will try to check my implementation. Another question: when to stop generation in the multi-line and random-span tasks, using
?
For multiline, there exist other stopping heuristics (see TruncationParameters here https://github.com/Eric-Wallace/codex/blob/main/infill_evaluation.py), but IIRC both https://github.com/bigcode-project/bigcode-evaluation-harness and our internal code use only EOT as stop symbol in multiline.
Hello,
I am trying to reproduce the infilling results on HumanEval (Table 14 CodeLLAMA 7B SPM, pass@1=83%). I am using the single-line benchmark from https://github.com/openai/human-eval-infilling. I use the below code to generate the samples.
Next I run the following for computing pass@1. I obtain pass@1= 0.73281 which is much smaller than the reported results.
evaluate_infilling_functional_correctness samples_base_pretrained_codellama.jsonl --benchmark_name=single-line
Can you please help with the following: