microsoft / CodeXGLUE

CodeXGLUE
MIT License
1.51k stars 363 forks source link

When Code tokens are split, codeBert can understand the original meaning? #96

Closed yz-qiang closed 2 years ago

yz-qiang commented 2 years ago

Hi, CodeBert is a nice work, and thank you for opening the code source. I get confused when I use the codeBert for downstream tasks. When a code token is split into different sub-tokens, can codeBert understand the meaning of the code tokens? For example, when I runtokenizer.tokenize("isFile()") the isFile() was be split into is, File and (). At this time, Can codeBert catch the connection between these sub-tokens? Besides, if I need to predict the \<mask> token in the code: os.path.<mask>, cobeBert can work? Please reply to me at your convenience, Thank you very much. :)

guoday commented 2 years ago
  1. Please refer to this replay. Tokenizer will use Ġ as a special token to represent the beginning sub-token.

  2. If you want to predict the <mask>, you can use this pipeline.

yz-qiang commented 2 years ago
  1. Please refer to this replay. Tokenizer will use Ġ as a special token to represent the beginning sub-token.
  2. If you want to predict the <mask>, you can use this pipeline.

Thanks for your reply. For the second point, I know can use the pipeline to predict the MASK position. But, the MASK in os.path.<mask> cannot be predicted as isFile(), because isFile() is not in the model vocabulary. There have any suggestions to help me fix this problem? Maybe, you will tell me to add isFile() into the model vocabulary. However, I doubt that a randomly generated vector of embedded words for the new word 'isFile()' would be useful, and I also worry about the Out Of Vocabulary problem. Can you give me some suggestions? Thank you. :)

guoday commented 2 years ago

Yes. CodeBERT can't predict a span like isFile(). Two suggestions for this: 1) using multiple <mask>, like os.path.<mask><mask><mask><mask>. 2) using another models like CodeT5 that can predict a span.

yz-qiang commented 2 years ago

Thank you very much, I will try it. :)