CodeParrot: Some Questions to Make it Better

Symbolk commented 2 years ago

I have tried the online CodeParrot-code generation demo in Spaces of HF (https://huggingface.co/spaces/lvwerra/codeparrot-generation), and have 3 questions, hoping that anyone can respond with an answer!

About the evaluation: according to the blog (https://huggingface.co/blog/codeparrot?utm_source=pocket_mylist), on HumanEval it falls far behind Codex, and the pass@k only increases a bit with the increase of parameter size, I am wondering what are the major reasons for that? And what can we do to improve it?
About the practical performance: when generating for other simple requirements (i.e., g t the median) except the given examples, it performs really not good, one obvious issue is that it generates incorrect code, and another is that it stops only when the max token number is reached, instead of dynamically stopping before another function or class via so-called stop sequence. My question is: how can we control it to stop generating endlessly in a similar way?
About the data processing: recent papers and new versions of Copilot have shown that multilingual code datasets do help the model to write better python code, since the dataset of CodeParrot is from only 2.8M repos (comparing to 54M repos of Codex), could the data size be a reason to explain the poor performance? (BTW, why files with >1M size were discarded? I also wonder this when readding the Codex paper)

Any help is appreciated! @lvwerra

lvwerra commented 2 years ago

Hi @Symbolk

Regarding question 1 & 3: I think there are two main reasons why the model performs worse than Codex:

We used considerably less compute. The model in the blog was trained on 26B tokens and the new released v1.1 was trained on an additional 14B tokens resulting in ~40B tokens in total. This is much less compared to Codex which was initialized from a GPT-3 checkpoint (trained on 300B tokens) and fine-tuned for another 100B tokens. Also AlphaCode uses 590B tokens to pretrain similar sized models.
Using a larger and potentially multilingual dataset could also help increase the performance of the model. The AlphaCode paper also did an ablation study and showed that petraining on multiple language improves downstream pefromance.

Regarding the stopping criteria: You can have a look at the generate function in the transformers library. It accepts a wide range of strategies to improve code quality:

beam search can help finding better solution candidates although it slows down generation
in general lower temperatures are better if you only generate one candidate solution (this is also discussed in the Codex paper: pass@1 uses a lower temperature than pass@100)
also play with top-k and top-p sampling

For the performance issues you can also do a number of things:

Make sure you run on a GPU
Use a max_time argument which stops generation after a maximum time is reached
Use a custom StoppingCriteria that stops after a condition is satisfied. We used this in the evaluation to make sure the generation stops after the function body is complete. We do this by looking for a substring that shows unindented code (e.g. a new function/class definition or a code comment on a new line). See the evaluation script for more details.

A final note: the main goal of CodeParrot was not to replace Codex or similar models but show that the tools are available to train such models. Enable somebody with more compute and needs to train such a model with these tools.

Hope this clarifies your questions!

Symbolk commented 2 years ago

Hi @lvwerra, many thanks for your response! Since recently I have done some survey on this trending domain, so I think I totally understand your information~

Another question is: do you think NL pretraining is necessary or beneficial to PL generation task? Although the Codex paper mentioned this:

Surprisingly, we did not observe improvements when starting from
a pre-trained language model, possibly because the finetuning dataset is so large.
Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments.

Since no model can compete or reproduce it, or fall far behind it according to this paper: https://arxiv.org/abs/2202.13169, could it be possible that the understanding capability of NL plays a vital role in the success of Codex?

This is an open question, but I am happy to hear from your opinion and guess!

lvwerra commented 2 years ago

You might find the insights in the AlphaCode paper interesting. They did train a decoder-only model on scratch on Python only and managed to match Codex's performance:

They also did an ablation study on the pretraining dataset (although the performance is measured after fine-tuning on more code data):

PS: To answer your question about why <1MB was selected - files larger than that are usually automatically generated and don't contain human written code. E.g. the BigQuery dataset excludes them by default.

Symbolk commented 2 years ago

Yes, I also noticed these figures earlier today, but you certainly remind me to think more. If we put the insights from Codex and AlphaCode together, I think we can conclude that multilingual PL data is more useful than NL data, at least it may bring improvements with a higher probability, as also proved by several other works in the literature (e.g., PolyCoder).

Although pretraining on MassiveText brings improvement upon Python-only, maybe the credit should be given to the multilingual Github portion in it?

huggingface / blog

CodeParrot: Some Questions to Make it Better #242