microsoft / CodeBERT

CodeBERT
MIT License
2.15k stars 444 forks source link

How to extract GraphCodeBERT embeddings? #112

Closed kb-open closed 2 years ago

kb-open commented 2 years ago

Thanks for the work done! I have two questions.

  1. How to get GraphCodeBERT embeddings for Python code?
  2. My downstream task is to classify Python functions in a file into different categories. Will it be possible to use GraphCodeBERT embeddings directly and feed it into a simple machine learning classifier (e.g., random forest) to train it (i.e., without needing any fine-tuning of GraphCodeBERT embeddings)? Or, would you suggest fine-tuning of the embeddings instead?
guoday commented 2 years ago
  1. Please refer to https://github.com/microsoft/CodeBERT/issues/94 Here's an example:
    >>> from transformers import AutoTokenizer, AutoModel
    >>> import torch
    >>> tokenizer = AutoTokenizer.from_pretrained("microsoft/graphcodebert-base")
    >>> model = AutoModel.from_pretrained("microsoft/graphcodebert-base")
    >>> code_tokens=tokenizer.tokenize("def max(a,b): if a>b: return a else return b")
    >>> tokens=[tokenizer.cls_token]+code_tokens+[tokenizer.sep_token]
    >>> tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
    >>> context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0][0,0]

2 Yes. It will be possible. But I guess only using embeddings of GraphCodeBERT to train classifier can't achieve best performance. Because we don't specially pre-train embeddings of code frame in GraphCodeBERT. Maybe UniXcoder (will released next week) is suitable since we design two pre-training tasks to learn code frame representation. However, if you can, fine-tuning will be a good choice.

kb-open commented 2 years ago

Thank you for the detailed answer. I'll try with GraphCodeBERT embeddings this week and UniXcoder embeddings next week (once it is released), and post the results here.

kb-open commented 2 years ago

hi @guoday, wouldn't context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0][0,0] return the embedding for the 0th token_id only?

Instead, would it make more sense to do either of the two things below?

Option 1:

context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0]

context_embeddings=torch.mean(context_embeddings, axis=1)

Option 2: context_embeddings=model(torch.tensor(tokens_ids)[None,:], attention_mask=torch.tensor(tokens_ids)[None,:].ne(1))[1]

guoday commented 2 years ago

0th token_ids is [CLS] token, whose embedding is better to represent the whole input.

Option 1 is to use mean pooling on the sequence. You can try it but we found that it‘s worse than [CLS] token embedding in zero-shot setting in UniXcoder paper.

Option 2 is to return a hidden state by using a randomly initialized pooling layer on [CLS] token embedding. It can't be used to represent the whole input since the pooling layer isn't pre-trained.

guoday commented 2 years ago

We have uploaded UniXcoder. Please follow this to get embedding.

SmitPatel910 commented 5 months ago

Thanks for the work done! I have two questions.

  1. How to get GraphCodeBERT embeddings for Python code?
  2. My downstream task is to classify Python functions in a file into different categories. Will it be possible to use GraphCodeBERT embeddings directly and feed it into a simple machine learning classifier (e.g., random forest) to train it (i.e., without needing any fine-tuning of GraphCodeBERT embeddings)? Or, would you suggest fine-tuning of the embeddings instead?

Hi, There I want to get the statement embeddings of the Python source code using the GraphCodeBERT model as It helps to capture the dataflow as well.

Can you help me to share the code snippet for that?