helboukkouri / character-bert

Main repository for "CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters"
Apache License 2.0
196 stars 46 forks source link

Generating word embeddings #2

Closed mlaugharn closed 3 years ago

mlaugharn commented 3 years ago

Hi, I was wondering if there was a simple way to generate word embeddings with character-bert

helboukkouri commented 3 years ago

Hi :)

Getting embeddings from CharacterBERT is very similar to what you would do with BERT. Let's take an example.

(Assuming you have already downloaded medical_character_bert)

from transformers import BertTokenizer
from modeling.character_bert import CharacterBertModel
from utils.character_cnn import CharacterIndexer

# Example text
x = "Hello World!"

# Tokenize the text
tokenizer = BertTokenizer.from_pretrained(
    './pretrained-models/bert-base-uncased/')
x = tokenizer.basic_tokenizer.tokenize(x)

# Add [CLS] and [SEP]
x = ['[CLS]', *x, '[SEP]']

# Convert token sequence into character indices
indexer = CharacterIndexer()
batch = [x]  # [x] is a batch with a single sequence x
batch_ids = indexer.as_padded_tensor(batch)

# Load some pre-trained CharacterBERT
model = CharacterBertModel.from_pretrained(
    './pretrained-models/medical_character_bert/')

# Feed batch to CharacterBERT & get the embeddings
embeddings_for_batch, _ = model(batch_ids)
embeddings_for_x = embeddings_for_batch[0]
print('These are the embeddings produces by CharacterBERT (last transformer layer)')
for token, embedding in zip(x, embeddings_for_x):
    print(token, embedding)