salesforce / CodeT5

Home of CodeT5: Open Code LLMs for Code Understanding and Generation
https://arxiv.org/abs/2305.07922
BSD 3-Clause "New" or "Revised" License
2.74k stars 401 forks source link

Code summarization example #126

Closed JaktensTid closed 1 year ago

JaktensTid commented 1 year ago

Hi, is there any example of a code summarization? Thank you in advance!

yuewang-cuhk commented 1 year ago

Yes, please refer to the below example. Salesforce/codet5-base-multi-sum is a CodeT5-base model that are jointly trained on 6 code summarization tasks using CodeSearchNet data.

from transformers import RobertaTokenizer, T5ForConditionalGeneration

if __name__ == '__main__':
    tokenizer = RobertaTokenizer.from_pretrained('Salesforce/codet5-base')
    model = T5ForConditionalGeneration.from_pretrained('Salesforce/codet5-base-multi-sum')

    text = """def svg_to_image(string, size=None):
    if isinstance(string, unicode):
        string = string.encode('utf-8')
        renderer = QtSvg.QSvgRenderer(QtCore.QByteArray(string))
    if not renderer.isValid():
        raise ValueError('Invalid SVG data.')
    if size is None:
        size = renderer.defaultSize()
        image = QtGui.QImage(size, QtGui.QImage.Format_ARGB32)
        painter = QtGui.QPainter(image)
        renderer.render(painter)
    return image"""

    input_ids = tokenizer(text, return_tensors="pt").input_ids

    generated_ids = model.generate(input_ids, max_length=20)
    print(tokenizer.decode(generated_ids[0], skip_special_tokens=True))
    # this prints: "Convert a SVG string to a QImage."
JaktensTid commented 1 year ago

Hi @yuewang-cuhk, thank you for your help! I have another question - is this possible to give an entire class or group of classes instead of single function so model can 'catch up' context more precisely, or I must fine-tune it for this task?

yuewang-cuhk commented 1 year ago

The model should be also able to summarize a larger code patch to some extent. But as it is trained using code-text pairs at the function level, finetuning it on your new use case would be definitely better.