huggingface / blog

Public repo for HF blog posts
https://hf.co/blog
2.37k stars 747 forks source link

CodeParrot: Some Questions to Make it Better #242

Open Symbolk opened 2 years ago

Symbolk commented 2 years ago

I have tried the online CodeParrot-code generation demo in Spaces of HF (https://huggingface.co/spaces/lvwerra/codeparrot-generation), and have 3 questions, hoping that anyone can respond with an answer!

  1. About the evaluation: according to the blog (https://huggingface.co/blog/codeparrot?utm_source=pocket_mylist), on HumanEval it falls far behind Codex, and the pass@k only increases a bit with the increase of parameter size, I am wondering what are the major reasons for that? And what can we do to improve it?

  2. About the practical performance: when generating for other simple requirements (i.e., g t the median) except the given examples, it performs really not good, one obvious issue is that it generates incorrect code, and another is that it stops only when the max token number is reached, instead of dynamically stopping before another function or class via so-called stop sequence. My question is: how can we control it to stop generating endlessly in a similar way?

  3. About the data processing: recent papers and new versions of Copilot have shown that multilingual code datasets do help the model to write better python code, since the dataset of CodeParrot is from only 2.8M repos (comparing to 54M repos of Codex), could the data size be a reason to explain the poor performance? (BTW, why files with >1M size were discarded? I also wonder this when readding the Codex paper)

Any help is appreciated! @lvwerra

lvwerra commented 2 years ago

Hi @Symbolk

Regarding question 1 & 3: I think there are two main reasons why the model performs worse than Codex:

Regarding the stopping criteria: You can have a look at the generate function in the transformers library. It accepts a wide range of strategies to improve code quality:

For the performance issues you can also do a number of things:

A final note: the main goal of CodeParrot was not to replace Codex or similar models but show that the tools are available to train such models. Enable somebody with more compute and needs to train such a model with these tools.

Hope this clarifies your questions!

Symbolk commented 2 years ago

Hi @lvwerra, many thanks for your response! Since recently I have done some survey on this trending domain, so I think I totally understand your information~

Another question is: do you think NL pretraining is necessary or beneficial to PL generation task? Although the Codex paper mentioned this:

Surprisingly, we did not observe improvements when starting from
a pre-trained language model, possibly because the finetuning dataset is so large.
Nevertheless, models fine-tuned from GPT converge more quickly, so we apply this strategy for all subsequent experiments.

Since no model can compete or reproduce it, or fall far behind it according to this paper: https://arxiv.org/abs/2202.13169, could it be possible that the understanding capability of NL plays a vital role in the success of Codex?

This is an open question, but I am happy to hear from your opinion and guess!

lvwerra commented 2 years ago

You might find the insights in the AlphaCode paper interesting. They did train a decoder-only model on scratch on Python only and managed to match Codex's performance:

Screenshot 2022-03-22 at 11 00 59

They also did an ablation study on the pretraining dataset (although the performance is measured after fine-tuning on more code data):

Screenshot 2022-03-22 at 11 03 06

PS: To answer your question about why <1MB was selected - files larger than that are usually automatically generated and don't contain human written code. E.g. the BigQuery dataset excludes them by default.

Symbolk commented 2 years ago

Yes, I also noticed these figures earlier today, but you certainly remind me to think more. If we put the insights from Codex and AlphaCode together, I think we can conclude that multilingual PL data is more useful than NL data, at least it may bring improvements with a higher probability, as also proved by several other works in the literature (e.g., PolyCoder).

Although pretraining on MassiveText brings improvement upon Python-only, maybe the credit should be given to the multilingual Github portion in it?