Closed hossein-kshvrz closed 3 years ago
>>>from transformers import AutoTokenizer, AutoModel
>>>import torch
>>>tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
>>>model = AutoModel.from_pretrained("microsoft/codebert-base")
>>>tokens=tokenizer.tokenize("def max(a,b):")
['def', 'Ġmax', '(', 'a', ',', 'b', '):']
>>>tokens=[tokenizer.cls_token]+tokens+[tokenizer.sep_token]
['<s>', 'def', 'Ġmax', '(', 'a', ',', 'b', '):', '</s>']
>>>tokens_ids=tokenizer.convert_tokens_to_ids(tokens)
[0, 9232, 19220, 1640, 102, 6, 428, 3256, 2]
>>>token_embeddings=model.embeddings.word_embeddings(torch.tensor(tokens_ids))
tensor([[ 0.1661, -0.0400, 0.0692, ..., 0.0133, 0.0191, 0.0131],
[-0.1560, 0.1749, -0.0762, ..., -0.2072, -0.0840, 0.0021],
[ 0.0572, -0.2749, -0.0944, ..., 0.0018, -0.0174, -0.2157],
...,
[ 0.2456, 0.1207, -0.0133, ..., 0.0995, -0.0460, 0.1347],
[-0.0049, 0.0522, -0.0181, ..., 0.0510, 0.0292, -0.0075],
[-0.0296, -0.0565, 0.0078, ..., 0.1082, 0.0717, -0.0224]],
grad_fn=<EmbeddingBackward>)
>>>context_embeddings=model(torch.tensor(tokens_ids)[None,:])[0][0]
tensor([[-0.1740, 0.2737, 0.0452, ..., -0.2411, -0.2950, 0.2668],
[-1.0550, -0.1229, 0.6714, ..., -0.5628, -0.1209, 0.4683],
[-0.9436, 0.3294, -0.0098, ..., -0.3375, -0.5014, 0.6879],
...,
[-0.3381, 0.4317, 0.4450, ..., -0.4600, -0.4070, 0.6626],
[-0.3735, -0.1088, 0.6358, ..., -0.6854, -0.0860, 0.2248],
[-0.1740, 0.2744, 0.0457, ..., -0.2414, -0.2962, 0.2675]],
grad_fn=<SelectBackward>)
Thanks!
@guody5 I have question on how the transformer tokenizer is implemented for codebert.
Given some function snippet, for example pc and pca_space below
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
## actual snippet
pc="""\
import os
i = int(input())
def check():
if(i%2==0):
print('even')"""
## space separated lines of snippet
pc_space="""\
import os i = int(input()) def check(): if(i%2==0): print('even')"""
tokens=tokenizer.tokenize(pc)
tokens_space=tokenizer.tokenize(pc_space)
For pc
and pca_space
, codebert tokenizer will give two different list of tokens. All the examples in repo shows examples with snippets where they have modified to space separated lines ignoring the syntax. So if I want to maintain syntax, the example function snippets should be as original(like pc
) or they should be preprocessed to space separated lines(like pcspace
).
Hi.
I want to obtain source code token embedding and I was wondering if I can use the CodeBERT pre-trained model for this purpose. If so, would you please give me some hints on how I can do it?