microsoft / CodeBERT

CodeBERT
MIT License
2.16k stars 446 forks source link

Questions about the inputs for getting embeddings #204

Closed rongqipan closed 1 year ago

rongqipan commented 1 year ago

Hi,

Thanks for your work. I tried to use CodeBERT, GraphCodeBERT and UnixCoder to extract Java code embeddings. However, for inputs to the models, I only used the Java source code, something like [CLS][JavaCode][SEP].

  1. Should I also add comments to the inputs?
  2. For GraphCodeBERT and UnixCoder, should I also add dataflow and also the flattened AST as input? Since I care about the execution time of the approach, so would adding that information (Comments, Dataflow and AST) make the time for getting embeddings much longer?

I would appreciate your kind suggestions,

Thanks.

guoday commented 1 year ago
  1. It's better to add comments
  2. You don't need to add dataflow or the flattened AST as input. The original code is enough. If you want to extract code embedding, I suggest you use UniXcoder which I test better on most datasets.
rongqipan commented 1 year ago

Thanks for your reply and kind suggestions : )