Token Embedding from CodeBERT

hossein-kshvrz commented 3 years ago

Hi.

I want to obtain source code token embedding and I was wondering if I can use the CodeBERT pre-trained model for this purpose. If so, would you please give me some hints on how I can do it?

guody5 commented 3 years ago

>>>from transformers import AutoTokenizer, AutoModel
>>>import torch
>>>tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>>model = AutoModel.from_pretrained("microsoft/codebert-base")
>>>tokens=tokenizer.tokenize("def max(a,b):")
['def', 'Ġmax', '(', 'a', ',', 'b', '):']
>>>tokens=[tokenizer.cls_token]+tokens+[tokenizer.sep_token]     
['<s>', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', '</s>']
>>>tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 9232, 19220, 1640, 102, 6, 428, 3256, 2]
>>>token_embeddings=model.embeddings.word_embeddings(torch.tensor(tokens_ids))
tensor([[ 0.1661, -0.0400,  0.0692,  ...,  0.0133,  0.0191,  0.0131],
        [-0.1560,  0.1749, -0.0762,  ..., -0.2072, -0.0840,  0.0021],
        [ 0.0572, -0.2749, -0.0944,  ...,  0.0018, -0.0174, -0.2157],
        ...,
        [ 0.2456,  0.1207, -0.0133,  ...,  0.0995, -0.0460,  0.1347],
        [-0.0049,  0.0522, -0.0181,  ...,  0.0510,  0.0292, -0.0075],
        [-0.0296, -0.0565,  0.0078,  ...,  0.1082,  0.0717, -0.0224]],
       grad_fn=<EmbeddingBackward>)
>>>context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0][0]
tensor([[-0.1740,  0.2737,  0.0452,  ..., -0.2411, -0.2950,  0.2668],
        [-1.0550, -0.1229,  0.6714,  ..., -0.5628, -0.1209,  0.4683],
        [-0.9436,  0.3294, -0.0098,  ..., -0.3375, -0.5014,  0.6879],
        ...,
        [-0.3381,  0.4317,  0.4450,  ..., -0.4600, -0.4070,  0.6626],
        [-0.3735, -0.1088,  0.6358,  ..., -0.6854, -0.0860,  0.2248],
        [-0.1740,  0.2744,  0.0457,  ..., -0.2414, -0.2962,  0.2675]],
       grad_fn=<SelectBackward>)

hossein-kshvrz commented 3 years ago

Thanks!

kakashiUc commented 3 years ago

@guody5 I have question on how the transformer tokenizer is implemented for codebert.

Given some function snippet, for example pc and pca_space below


from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

## actual snippet
pc="""\
import os

i = int(input())

def check():
    if(i%2==0):
        print('even')"""

## space separated lines of snippet
pc_space="""\
import os i = int(input()) def check(): if(i%2==0): print('even')""" 

tokens=tokenizer.tokenize(pc)
tokens_space=tokenizer.tokenize(pc_space)

For pc and pca_space, codebert tokenizer will give two different list of tokens. All the examples in repo shows examples with snippets where they have modified to space separated lines ignoring the syntax. So if I want to maintain syntax, the example function snippets should be as original(like pc) or they should be preprocessed to space separated lines(like pcspace).

microsoft / CodeBERT

Token Embedding from CodeBERT #15