microsoft / CodeBERT

CodeBERT
MIT License
2.25k stars 455 forks source link

Which paper is 'replaced token detection' inspired by? #82

Closed skye95git closed 2 years ago

skye95git commented 3 years ago

Hi, I watched a video where Duyu Tang introduced Codebert. The 'replaced token detection' appears to have been inspired by a 2020 paper by Google and Stanford. Duyu did not mention which paper it was. Can you share the title of this paper?

According to Duyu, 'replaced token detection' is meant to take advantage of uncommented code, or only comments without code. Why do you use Discriminator to identify which word is replaced so you can use both types of data?

guoday commented 3 years ago

The paper is here https://arxiv.org/pdf/2003.10555.pdf

skye95git commented 2 years ago

The paper is here https://arxiv.org/pdf/2003.10555.pdf

Thanks for your reply. I find the tokenizer used for fine-tuning is --tokenizer_name=microsoft/codebert-base. Where does this tokenizer come from, that you retrained on the code domain?

guoday commented 2 years ago

The tokenizer comes from roberta-base. We don't re-train the tokenizer.