huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.68k stars 26.71k forks source link

How to generate BERT/Roberta word/sentence embedding? #2986

Closed zjplab closed 4 years ago

zjplab commented 4 years ago

I know the stanford operation.

tokenizer = RobertaTokenizer.from_pretrained('roberta-large')
model = RobertaModel.from_pretrained('roberta-large')

input_ids = torch.tensor(tokenizer.encode("Hello, my <span class="highlighter highlight-on">dog</span> is cute", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)

last_hidden_states = outputs[0] #(batch_size, input_len, embedding_size) But I need single vector for each sentence 

But. I am working on improving RNN with incorporating Bert-like pretrain model embedding. How to get a sentence embedding so in this case(one vector for entire sentence)? Averaging or some transformation of the last_hidden_states? Is add_special_token necessary? Any suggested papers to read?

BramVanroy commented 4 years ago

Hi there. A few weeks or months ago, I wrote this notebook to introduce my colleagues to doing inference on LMs. In other words: how can I get a sentence representation out of them. You can have a look here. It should be self-explanatory.

cformosa commented 4 years ago

Hey @zjplab, for sentence embeddings, I'd recommend this library https://github.com/UKPLab/sentence-transformers along with their paper. They explain how they get their sentence embeddings as well as the pros and cons to several different methods of doing it. They have embeddings for bert/roberta and many more

lefnire commented 4 years ago

There's also spaCy's wrapper of transformers spacy-transformers. Can compare sentences to each other, and access sentence embeddings:

examples/Spacy_Transformers_Demo.ipynb

# $ pip install spacy-transformers
# $ python -m spacy download en_trf_bertbaseuncased_lg

import spacy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")

# sentence similarity
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963

# sentence embeddings
apple1.vector  # or apple1.tensor.sum(axis=0)

I'm fairly confident apple1.vector is the sentence embedding, but someone will want to double-check.

[Edit] spacy-transformers currenty requires transformers==2.0.0, which is pretty far behind. It also doesn't let you embed batches (one sentence at a time). I'm gonna use UKPLab/sentence-transformers, personally.

njfm0001 commented 4 years ago

There's also spaCy's wrapper of transformers spacy-transformers. Can compare sentences to each other, and access sentence embeddings:

examples/Spacy_Transformers_Demo.ipynb

# $ pip install spacy-transformers
# $ python -m spacy download en_trf_bertbaseuncased_lg

import spacy
nlp = spacy.load("en_trf_bertbaseuncased_lg")
apple1 = nlp("Apple shares rose on the news.")
apple2 = nlp("Apple sold fewer iPhones this quarter.")
apple3 = nlp("Apple pie is delicious.")

# sentence similarity
print(apple1.similarity(apple2)) #0.69861203
print(apple1.similarity(apple3)) #0.5404963

# sentence embeddings
apple1.vector  # or apple1.tensor.sum(axis=0)

I'm fairly confident apple1.vector is the sentence embedding, but someone will want to double-check.

[Edit] spacy-transformers currenty requires transformers==2.0.0, which is pretty far behind. It also doesn't let you embed batches (one sentence at a time). I'm gonna use UKPLab/sentence-transformers, personally.

Is there any way to compare a contextualized word embedding with a word embedding? Let's say I have a sentence "Apples are delicious" and I want to compare the similarity of the contextualized word "apples" against words such as "fruit" or "company". Is there any way to do so with transformers like BERT that could deliver reliable numbers? Thanks in advance.

kitsiosk commented 4 years ago

This one seems to do the job too: https://github.com/ashokc/Bow-to-Bert, accompanied with this blog post http://xplordat.com/2019/09/23/bow-to-bert/