microsoft / CodeBERT

CodeBERT
MIT License
2.21k stars 452 forks source link

Correct formatting of data passed to CodeReviewer #241

Open p4vv37 opened 1 year ago

p4vv37 commented 1 year ago

Hi Could you provide a code as an example of how to prepare data to be passed to the CodeReviewer model? Is the marker for "old_file" ? So the text provided to the tokenizer should look something like this: <s>source file content</s><add>new line<add>another one etc. ?

I tried implementing a GitHub commits code review this way and I'm wondering if that's the correct way of doing this.

celbree commented 1 year ago

The data preparation is in this file https://github.com/microsoft/CodeBERT/blob/master/CodeReviewer/code/utils.py. You can refer to different Dataset classes for different tasks.