Is BERT powerful enough to learn sentence embedding and word embedding?

xiaoming-qxm commented 5 years ago

After reading the BERT, Pre-training of Deep Bidirectional Transformers fo r Language Understanding paper, I had a fundamental question want to figure out.

Based on my current understanding, I think the main contribution of BERT is learning sentence embedding or capturing sentence internal structure in an unsupervised way. When training the model, the authors said:

We use WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary. We denote split word pieces with ##.

It seems that the loaded word embedding was pre-trained. However, the parameters of the word embedding layer were randomly initialized in the open source tensorflow BERT code. This inconsistency confused me a lot.

So My question is:

Is BERT could also learn powerful word embedding representation compared with the state-of-the-art word embedding algorithms?

hanxiao commented 5 years ago

You may use bert-as-service for a quick evaluation by yourself. Sentence and ELMO-like token-level embedding are fairly easy to obtain using this service.

xwzhong commented 5 years ago

In my own understanding, word embedding just a set of paraments, just like the self attention parament, the parament will be more useful when they are used together, if you just use the trained word embedding, it maybe perform poorly.

imgarylai commented 5 years ago

Hello @daoliker ,

From my colleague's works, he replicated many SOTA NLP tasks and tried to replace all previous word representation to BERT. Most of the works get significant performance improvement. He doesn't try to use end-to-end fine-tuning on those tasks, because BERT consumes a lot of resources.

If you want to get the word embedding from BERT. I implement a BERT embedding library which makes you can get word embedding in a programatic way.

https://github.com/imgarylai/bert-embedding

Because I'm working closely with mxnet & gluonnlp team, my implementation is done by using mxnet and gluonnlp. However, I am trying to implement it in all other different frameworks.

Hope my works can help you.

abhinandansrivastava commented 5 years ago

Hi, After running the BERT Model, I am getting embedding for each word in a sentence, But need to get the sentence embedding. How to find that?

I tried to do max-pooling of all the word embedding but the output is not good.

hanxiao commented 5 years ago

@abhinandansrivastava then perhaps try different pooling strategies using bert-as-service

ChenXi1992 commented 5 years ago

@abhinandansrivastava A naive but strong sentence embedding baseline is average word Embedding.

singularity014 commented 5 years ago

@ChenXi1992 agree with average embedding, but in case of BERT it doesn't work very well...I have tested this.

BoPengGit commented 5 years ago

Are the word piece embeddings for BERT pretrained or randomly initialized when the BERT model was originally trained?

xdwang0726 commented 5 years ago

@abhinandansrivastava I think you can use the CLS provided by BERT for sentence embeddings

ksboy commented 4 years ago

@abhinandansrivastava I think you can use the CLS provided by BERT for sentence embeddings

I don't think it's a good idea for Non classification tasks. According to Transformers:

This output is usually not a good summary of the semantic content of the input, you're often better with averaging or pooling the sequence of hidden-states for the whole input sequence.

Also bert-as-servic

Because a pre-trained model is not fine-tuned on any downstream tasks yet. In this case, the hidden state of [CLS] is not a good sentence representation. If later you fine-tune the model, you may use [CLS] as well.

And I have proved it for text matching task.

kirtikakde commented 4 years ago

can we use bert for marathi text

rahulkrishnan98 commented 4 years ago

@abhinandansrivastava I think you can use the CLS provided by BERT for sentence embeddings

Can you explain more into what [CLS] captures? Why is the alternate not preferred, for instance, when we can take the embedding of the other tokens as well and reshape or pool based on use-case. More specifically, It would be great if someone can point me to what exactly [CLS] picks from the sentence that helps it represent it fully.

Chandler-Bing commented 4 years ago

hi, thank u for this job,i think its so great however i have trouble getting embedding of sentences.my texts are always very long(20k chars), i notice that max_seq_len 512 is the maximun number, is there any methods to get long sentences embedding? p.s. currently i split text to some short text(length less than 512),and get average embeddings of all text,but the output is not good how should i do this task,i am confused,

xdwang0726 commented 4 years ago

hi, thank u for this job,i think its so great however i have trouble getting embedding of sentences.my texts are always very long(20k chars), i notice that max_seq_len 512 is the maximun number, is there any methods to get long sentences embedding? p.s. currently i split text to some short text(length less than 512),and get average embeddings of all text,but the output is not good how should i do this task,i am confused,

You could write PositionalEncoding yourself in order to customize the sequence length.

mathshangw commented 2 years ago

Hi, After running the BERT Model, I am getting embedding for each word in a sentence, But need to get the sentence embedding. How to find that?

I tried to do max-pooling of all the word embedding but the output is not good.

Excuse me how did you get the embedding for word not a sentence using Bert please

google-research / bert

Is BERT powerful enough to learn sentence embedding and word embedding? #261