salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.81k stars 421 forks source link

How to do Retrieval-Augmented Generation? #135

Open luweigen opened 1 year ago

luweigen commented 1 year ago

Any hints for reproducing the example in Figure.7 in the paper CodeT5+: Open Code Large Language Models for Code Understanding and Generation? Thanks in advance!

yuewang-cuhk commented 1 year ago

Hi Wei,

Let me share some guidance here. For the retrieval-augmented code generation, we follow the settings introduced in this paper (Retrieval Augmented Code Generation and Summarization) to evaluate our models. We adopt a straightforward approach, where we use the CodeT5+'s encoder for retrieving the top-1 code candidates and then concatenate it with the source text for the model's encoder, and the decoder is trained to generate the target code. You can employ this embedding model for the retrieval part. You need to prepare a training dataset of "text+retrieved top-1 code" and "target code" pairs for finetuning before the evaluation.