google-research / bert

TensorFlow code and pre-trained models for BERT
https://arxiv.org/abs/1810.04805
Apache License 2.0
38.04k stars 9.59k forks source link

Feature vectors represent word embeddings ? #72

Closed astariul closed 5 years ago

astariul commented 5 years ago

I thought the feature vectors extracted from BERT represents word embeddings.

So I thought, in order to use these embeddings, one just have to extract it (using extract_features.py), then load the weights in an Embedding layer (yes, I'm a Keras person). Then just build whatever we want on the top of this Embedding layer.

But it is wrong, isn't it ? Using extract_features.py, I got the weights of the last 4 layers, for each words in each sentences fed as input !

So instead of having 4 X weights (X being the size of a layer) as I expected, I have 4 X * tokens_used_in_input_file weights !


How do I use the Feature vectors to build on top of BERT a task-specific model architecture ?

jacobdevlin-google commented 5 years ago

The embedding table is context-free wordpiece embeddings. These are not particularly useful. They will just be worse versions of what you would get from GloVe/word2vec/FastText etc.

extract_features.py gives you contextual representations, which are "embeddings" of each token in the context of the sentence. This is what you would want to build a model on. For this, you need to run your full training and test data through extract_features.py and use the input vector just like you would use an embedding (to handle the 4x, you can just concatenate the 4 vectors for each word).

astariul commented 5 years ago

you need to run your full training and test data through extract_features.py and use the input vector just like you would use an embedding (to handle the 4x, you can just concatenate the 4 vectors for each word).

Oh I see.

I thought extract_features.py is a script to process the Embeddings and then we can use these wherever we want.

But from what you said, extract_features.py IS the Embeddings layer.

It makes sense, having Embeddings for each words independently would mean no context.

Thank you very much for your kind and clear explanations.