salesforce / CodeGen

CodeGen is a family of open-source model for program synthesis. Trained on TPU-v4. Competitive with OpenAI Codex.
Apache License 2.0
4.94k stars 381 forks source link

Why tokenizer.pad_token == args.pad (i.e., 50256)?? #57

Open 9yte opened 1 year ago

9yte commented 1 year ago

Hi,

For my project, I'm trying to fine-tune CodeGen models on my dataset and evaluate the resulting fine-tuned model on the HumanEval benchmark dataset. I have a few questions that I would appreciate if you could address.

  1. First, why in the sampling code, at line 234, we have tokenizer.pad_token == args.pad, which is 50256. Shouldn't we set the pad_token to eos_token, not 50256 (which is the eos_token_id)? I'm confused by this. At line 240, you set the parameter pad_token_id=args.pad. So in your sampling code, both pad_token and pad_token_id are set to 50256. Can you please elaborate on this? That would be super helpful.

  2. As a baseline, I need to replicate your single-turn HumanEval benchmark results, but unfortunately, I'm getting surprisingly lower results compared to what is reported in the paper. And, I'm 99% positive that I'm probably missing a point. To produce Table 1 results in the paper, did you use the exact same sampling procedure as sample.py?

Thanks a lot for your time.