keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.98k stars 19.48k forks source link

How does Embedding Layer work? #3110

Closed amityaffliction closed 7 years ago

amityaffliction commented 8 years ago

I'm curious about how embedding layers work

Embedding layer is trained for objective function that user can specify. for example 'mse' can be passed as argument

Q1: but when using 'mse' which is (0.5) * (target - output)^2 what is target and output when training embedding layer? and what is embedding vector that model generates?

here is what I understand. is it right?

for example input takes sentences in each word as index (ex [0,12, 4, 4, 1]) and embedding model(RNN) is trained to predict next word for example

input 0 -> ouput 12 input 12 -> output 4 .. and so on... and error metric is 'mse'

is it right? where does embedding vector come from?

I thought it as RNN hidden state activation but hidden state activation changes through context...

Q2: is there Keras example that can be used as Mikolov Word2Vec model?

Q3: sentence can be variable length sequence that consists of different number of words. how can we train embedding layer with variable length sequence??? do we have to pad it with special token? I think it seems inefficient

jmhessel commented 8 years ago
  1. 'mse' can't really be passed as an argument to Embedding. 'mse' can be a loss function for a model that contains an embedding layer, however. Embedding layers map from indices to vectors, i.e. index 7 to vector [3.2,-.3,.52...]. The parameters of the layer are the lookup table values.

  2. Below is my implementation of skip-gram in keras. This has not been thoroughly tested, so use at your own discretion. You'll also need to write code to do the negative sampling, and the window generation.

  3. Check out the keras.io documentation. The API supports "masking" which is what you're talking about.

def make_word2vec_model(num_ns, embedding_dim, num_words, separate_context = True):
    '''
    num_ns: (int) number of negative samples
    embedding_dim: (int) embedding dimension
    num_words: (int) size of the vocabulary
    separate_context: (bool) whether or not to use a separate set of word embeddings
                            for the context embeddings. Reccomended for text data to
                            better fit the distributional hypothesis.

    creates a skipgram with negative sampling model. word_input is a (batch_size, 1)
    shaped tensor indicating the index of the center word in the sliding window. context_input
    is a (batch_size, num_ns+1) shaped tensor, where the first column are the indices
    of the positive samples, and the num_ns following columns are the indices of the negative
    samples. Labels should be a (batch_size, num_ns+1) shaped tensor, where the first column
    is ones (positive label) and the rest of the matrix is zero (negative label)
    '''
    from keras.layers import Input, Embedding, Reshape, Merge, Flatten, Activation
    from keras.models import Model

    word_input = Input(shape = (1,), dtype = 'int32')
    context_input = Input(shape = (num_ns+1,), dtype = 'int32')

    node_embedding = Embedding(num_nodes, embedding_dim)
    we = node_embedding(word_input)

    if separate_context:
        context_embedding = Embedding(num_nodes, embedding_dim)
    else:
        context_embedding = node_embedding
    ce = Reshape((embedding_dim,num_ns+1))(context_embedding(context_input))

    dots = Flatten()(Merge(mode = 'dot', dot_axes = (1,2))([ce, we]))
    acts = Activation('sigmoid')(dots)

    model = Model(input = [word_input, context_input], output = acts)
    model.compile('adam', loss = 'binary_crossentropy')
amityaffliction commented 8 years ago
  model = Sequential()
  model.add(Embedding(1000, 64, input_length=10))
  # the model will take as input an integer matrix of size (batch, input_length).
  # the largest integer (i.e. word index) in the input should be no larger than 999 (vocabulary size).
  # now model.output_shape == (None, 10, 64), where None is the batch dimension.

  input_array = np.random.randint(1000, size=(32, 10))

  model.compile('rmsprop', 'mse')
  output_array = model.predict(input_array)
  assert output_array.shape == (32, 10, 64)

Above is Keras Embeddings layer example in the web page http://keras.io/layers/embeddings/ It passes 'mse' as argument. I was asking how does embeddings layer do task

  1. 'indice -> vector'
  2. and what does objective function do in this process??

@jmhessel Thank you for your answer

katomaso commented 8 years ago

Quick look into the source code reveals that the Embedding layer has a set of (trainable) weights which are used as params in keras' engine method gather. Suppose tensorflow as the engine now. https://www.tensorflow.org/versions/r0.9/api_docs/python/array_ops.html#gather

As you can see the result of Embedding is just a permutation of inner trainable weights. Where the permutation is denoted by "indices" in your input array.

ahmadpgh commented 7 years ago

Hello jmhessel and katomaso,

Thanks for your helpful guide.

Since in here you talked about how to deal with word embeddings especially in the skipgram model, I wanted to know how we can save an embedding layer in the way that can be seen in a regular word embeddings file (i.e. a text file with .txt extension). Let’s assume we either learn these word embeddings in the model from scratch or we update those pre-trained ones which are fed into the first layer of the current model. Is there any way around this?!

Thank you in advance.

jmhessel commented 7 years ago

If you've trained an embedding layer emb, you can get the resulting word-by-dimension matrix by my_embeddings = emb.get_weights()[0]. Then, you can do normal numpy things like np.save("my_embeddings.npy", my_matrix)

ahmadpgh commented 7 years ago

Thanks jmhessel. Your previous answer was very helpful. However I have another question here. Let's assume we have built two columns of networks in keras and these two columns are exactly the same which will merge together on their top and then feed into a dense layer which is the output layer in the model. My question is, while the first layer of each column here is an embedding layer, how can we share the the weights of the similar layers in the columns? No need to say that we have a parameter (like 'separate_context' that you had in your code) set as false meaning we would only have one embedding matrix which we work with. Thanks in advance.

Tachyon5 commented 7 years ago

I am very curious how the embedding layer is different from skip-gram or other word2vec variants? Is it the same thing or different and if different, then how? Can I take any set of sparsely-coded categorical variables and embed them? I would love to see a paper but I only see a link to the "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks" within the documents.

ahmadpgh commented 7 years ago

Skip-gram, CBOW, and GloVe (or any other word2vec variant) are pre-trained word embeddings which can be set as the weight of an embedding layer. If the weight of this layer (generally the first layer of the network) is not initialized by these pre-trained vectors, the model/network itself would assign random weights and will learn the embeddings (i.e. weights) on the fly. Also, when the embedding layer is benefiting from these pre-trained word embeddings - which put the network in a proper state before training, you can define whether you like the network update these weights (i.e. embeddings) or you would like to have them frozen (It is task dependent, but in general letting the network update the weights would give you a better answer). I believe, in a naive way, you can consider any type of low-dimension data (which represent a high-dimension data) as your embeddings. Keep in mind that embeddings are not only limited to natural language words as in the recent studies they have applied it to the biological concepts and social graphs as well.

Edited and Added very recently: For a more in-depth understanding of word embeddings, the embedding space, their parameter settings in a network, and that how an embedding layer tunes its embedding weights with respect to given inputs, please refer to my more comprehensive explanation that I added to this thread very recently. You can find the post close to the bottom of this page!

einareinarsson commented 7 years ago

The essence is still not have been explained, namely, how does the Embedding layer figure out the appropriate weights in the proper way (representing semantic relationship in the vector space)?

PaulX-CN commented 7 years ago

I have a question, when I am training with questions and answers pairs, I always need to reduce the size of my vocabulary down to say 10k. The rest words will become UNKNOWN in my vocab. Plus, there will be EOS and PAD in the sequence to make up a fixed length sequence. If I want to use pretrained embedding, what is the embedding for EOS and UNK then? I know that I can mask the PAD (by setting masking=True in embedding layer) so they are not considered a word but EOS and UNK should be treated as a meaning word, am I correct? However both EOS and UNK do not exist in any pretrained model.

ahmadpgh commented 7 years ago

You need to have a method to construct distributional representations for token. There are three strategies around this: onehot: represent unknown token with a one-hot vector; averaged: representation vector of unknown token is set to the vector of a word with frequency 1 that is the closest to an average of vectors of words with frequency 1; random: representation vector of unknown token is set to a representation of a random word with frequency 1.

For the end of sentence token (usually represented as <\S>), which can be important sometimes, you can either look for a better pre-trained model that includes EOS, or you can use onehot strategy here as well.

vsoto commented 7 years ago

Is there a paper/citation for how the embedding layers work in keras?

graydan commented 7 years ago

So the embedding layer updates on the fly during training, and it performs a dictionary lookup, but how? Is it an RNN? I read "A Theoretically Grounded Application of Dropout in Recurrent Neural Networks," but its not clear how the the Keras embedding layer is implemented. Also, any help on where to start understanding the Keras source code would be great!

zzkszzks commented 7 years ago

According to keras's official example: imdb_fasttext.py, I think the embedding layer is a fasttext implement.

zuxfoucault commented 7 years ago

I'm not sure if my understanding is correct but...

While training a seq2seq model, one of the purpose I want to initiated a set of pre-trained fasttext weights in the embedding layers is to decrease the unknown words in the test environment (these unknown words are not in training set). Since pre-trained fasttext model has larger vocabulary, during test environment, the unknown word can be represented by fasttext out-of-vocabulary word vectors, which supposed to share similar direction of the semantic similar words in the training set.

However, due to the fact that the initial fasttext weights in the embedding layers will be updated through the training process (as mentioned in previous discussion, this setting generates better result). I am wondering if the updated embedding weights would distort the relationship of semantic similarity between words and undermine the representation of fasttext out-of-vocabulary word vectors? (and, between those updated embedding weights and word vectors in the initial embedding layers but their corresponding ID didn't appear in the training data)

If the input ID can be distributed represented vectors extracted from pre-trained model and, then, map these pre-trained word vectors (fixed weights while training) via a lookup table to the embedding layers (these weights will be updated while training), would it be a better solution?

Any suggestions will be appreciated!

stale[bot] commented 7 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 30 days if no further activity occurs, but feel free to re-open a closed issue if needed.

fqassemi commented 6 years ago

I have worked on NLP for a while, now. I understand tf-idf, and I do understand the word frequency. GloV and other word2vec are also understood (which transform words to vectors not to an integer). Now, converting a word to an integer, is still a myth for me! If it is just word_frequency, then it is fine (not very exciting though). I also have problem with "learning on the fly", Embedding is actually representation of our input, so input can not be learnt on the fly! (My point is that optimization for embedding layer should be done before neural net start training. Otherwise what we are learning). So, if anyone can just explain this with examples, just compare output results of embedding and GloV, or word2vec, and explain how a vector or a word is converted to an integer is really appreciated.

ahmadpgh commented 6 years ago

GloVe and word2vec DO NOT convert a word to an integer. They change the discrete/atomic representation of the vocabulary words from one-hot encoding representations (disjoint representation of words) to low-dimensional continuous/distributed vectors that we call word embeddings (see below)!!

It may sound a bit intimidating and mouthful at first, but once you get your head wrapped around the idea and learn how to use these embeddings everything suddenly falls into place and you will find the whole idea quite simple. So, to understand the word embeddings, please stay tuned as I try to break down the whole idea for you as follows:

These are some questions that I try to answer here:

1) What is the difference between input data and word embedding vectors? I.e., what is the relation between one-hot encoding vectors and word embedding vectors? 2) How are word embedding vectors used, and how do they get trained/updated? What's their correct setting in Keras? 3) Why is the representation of the words in the embedding space important? Why is it better to initialize a supervised network with unsupervised pre-trained word embeddings prior to training? 4) If we use an intermediate word embedding layer, what's the exact number of neurons in the first layer of the network? How about the second layer? 5) Does word embedding apply to words only, or it can refer to any useful token in one language/application?

(Important: I do NOT specifically describe how word embeddings are pre-trained using one particular unsupervised technique that you might have heard before; however, what you read here will help you picture in your mind what happens when those techniques are deployed.)

Now let's begin:

First off, don't get bogged down by the terms "word embedding vectors", "word vectors", "word embeddings" and "embedding vectors". They are just different ways of describing the same concept: low-dimensional continuous vectors for words. So you can use them interchangeably.

It is common to see these embedding vectors computed using an unsupervised technique such as GloVe or word2vec (or even other simpler techniques; for example, LSI/LSA applied to a TF-IDF matrix).

In supervised applications, these embedding vectors tend to sit on top of the input layer; this assignment will set the first weights of the supervised network. The input layer will be receiving the one-hot vectors of the (input) words and NOT their embeddings. During the training and backpropagation, the input of the network will remain unchanged, that is to say, only these embedding weights (and other weights of the network) will be trained. Even though in practice it is common to initialize these embedding weights with the GloVe or word2vec embeddings - in an attempt to put the network in a proper state prior to training (i.e. trying to avoid falling into some local minima), these weights can be assigned randomly and can be solely trained during the supervised training of the model as well (not the best recommendation though, unless enough training data is available).

To make it more clear, assume we have this set of vocabulary words in the language/application: {'bar', 'foo', 'baz', 'qux'}. For each of them we are going to have a one-hot vector encoded like this:

One-hot vectors:

Discrete Representation high-dimensional = vocabulary size + 1 (usually ~2K-2M) 'OoV': [1, 0, 0, 0, 0] 'bar': [0, 1, 0, 0, 0] 'foo': [0, 0, 1, 0, 0] 'baz': [0, 0, 0, 1, 0] 'quz': [0, 0, 0, 0, 1],

since there is a chance to see an unseen word during testing or in production, it is common to consider a generic term for those UNKNOWN or Out-of-Vocabulary words, we call this generic term OoV.

Now, for each of these words we can have a word embedding (i.e. a word vector) of size 2 pre-trained or randomly initialized (2 is one of the hyperparameters of the network - typically a number between 50 to 300):

Word (embedding) vectors (trained):

Continuous Representation low-dimensional = embedding size (usually ~50-300) 'OoV': [+0.67, -0.33] 'bar': [+0.10, +0.25] 'foo': [-0.53, +0.61] 'baz': [+0.13, +0.24] 'quz': [-0.11, -0.17]

If you look at the above vectors stacked on top of each other as they are shown, they form a matrix. Now, we have our lookup table (the above matrix) from which the embeddings come and in which the embeddings will be updated. As mentioned, generally, in the lookup table, we will have an extra vector of OoV that is initialized by zeros (or mean of embeddings of low-frequent words) before training and sits on top of all the vocabulary word embeddings; again, this vector represents out-of-vocabulary words that we might encounter in our application. Consider also, if bar and baz are syntactically and/or semantically close words, we expect to see more similar weights or word (embedding) vectors for those two; this is the main reason for initializing the (embedding layer of the) network with the pre-trained word embeddings computed with GloVe or word2vec. Making use of these unsupervised techniques (GloVe, word2vec, CBOW, ...) for word representation, similar words in the language (syntactically and/or semantically) will have closer vector representations in the calculated embedding space, and as words become less related their distances grow larger. Likewise, in a supervised application, we expect to see the same behavior for the words; so better to give the network some hints in advance instead of asking the network to find out about the words similarities and dissimilarities completely on its own! That is why when we initialize an embedding layer with GloVe or word2vec word vectors usually a minor and faster readjustment of these embeddings is all that we need (regarding the embedding layer for that particular application) while for randomly initialized embeddings a whole new configuration of the embeddings is needed from scratch (and that's why we might need more data to complete this process successfully).

Now, let's imagine the input of the network is the sentence "quz foo" (you can look at it as a sequence or BoWs). The one-hot vectors of 'quz' and 'foo' (the inputs of the network) will be multiplied to the lookup table - which holds the weights of the first layer of the network - to on their associated embeddings accordingly and off the embeddings of the absent words in the sentence. The result of this multiplication will be the input of the next layer of the network (can be fully-connected, CNN, LSTM, etc). And, during the backpropagation, only these weights (i.e. embeddings) will be trained (if we set the network to allow this update - i.e. trainable parameter in Keras for the embedding layer must set to be True).

FYI, in practice, for computational efficiency, there is no multiplication of one-hot vectors to the lookup table. I.e, for each word, one index is assigned which is equivalent to the row number of that word in the lookup table. Knowing this index number is going to help to retrieve the associated word vector of the given word from the lookup table much faster.

Input Layer and Embedding Layer Settings in Keras:

For a full understanding, you might want to refer to this link: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

Now in our example, if we assign indices to our list of words we have something like this (Keras will do this for us using the Tokenizer class and pad_sequences function): 'OoV': 0, 'bar': 1, 'foo': 2, 'baz': 3, 'quz': 4,

In short, for Input layer we need to have something like:

sequence_input = Input(shape=(MAX_WORD_SEQUENCE_LENGTH,), dtype='int32')

In our case, let's assume we consider the length of the given input to be 10 words (another hyperparameter), so we have:

given_input = Input(shape=(10,), dtype='int32')

If some sentences are longer or shorter than 10, we need to truncate or (zero-)pad the sentence. Now, considering the indices, the input for the sentence "quz foo" will be: [4 2 0 0 0 0 0 0 0 0]. As mentioned above, for computational efficiency, Input layer receives the indices of the one-hot vectors instead of the one-hot vectors themselves.

For Embedding layer we need to have something like:

embedding_layer = Embedding(input_dim=vocabulary_size + 1,                                                    output_dim=EMBEDDING_DIM,                                                    weights=[embeddings_matrix],                                                    input_length=MAX_WORD_SEQUENCE_LENGTH,                                                    trainable=Fasle)

To have the embedding layer in our example, we have:

embedding_layer = Embedding(input_dim=4 + 1,
                            output_dim=2,
                            weights=lookup_table,
                            input_length=10,
                            trainable=True)

embedding_layer_receiving_the_input = embedding_layer(given_input)

So, the input of the embedding layer input_dim has the dimension of 5 (_vocabularysize + 1; the extra 1 is for OoV terms) which is the dimension of the one-hot vectors of the words in input data. The _embeddingdim or output_dim of the embedding layer is 2 which defines the input_shape of the next layer of the network (if you wonder, the next layer's input_shape is (2,) ). Therefore, the size of the weights for the embedding layer has to be [5, 2].

If we do not set weights argument (i.e. not using pre-trained embedding vectors generated by GloVe, word2vec, etc.), the embeddings/weights would be initialized randomly; however, the size of the embedding matrix is still [5, 2].

Multiplication of one of the one-hot vectors (let's say x) of the size 5 to the weights or lookup table (let's say W) will give us a vector of size 2 which is word embedding of the associated word for that one-hot vector.

Considering the number of words within an instance/sentence that is inputted to the networks, the input_length simply means how many times this multiplication between one-hot vectors and the lookup table should occur (to be more exact the number of transitions from words' indices to their embeddings). This means what weights should be set on (and what weights should be set off) for each of the content words of the given input instance/sentence.

Last but not least, if you still want to picture in your mind that in reality how many neurons we have in the first layer of the network, this number can be calculated by: (length of the input sentence/instance) (one-hot vector size) = input_length input_dim In our example, this number is equal to 10*5=50 neurons, receiving 10 one-hot vectors of size 5 for each input sentence (with 10 words). However, due to making use of words' indices instead of one-hot vectors themselves as the real inputs to the network (while indices denote exactly where 1s occur) and also with the special settings of input and embedding layers (mentioned above), we do not have to explicitly determine anywhere in our code the number of neurons in the first layer of the network. Also be aware that this (usually large) number of neurons in the first layer would be downsized to the embedding size, i.e. the embedding layer's output_dim, (2 neurons in our example) for the next layer which depending on the structure of the network can be fully-connected, CNN, RNN, LSTM, etc.

As the last point, consider that depending on your NLP application, the vocabulary does not have to be limited to spoken words only as it may also include punctuations, the casing of the words, plural and singular forms, a generic word for NUMBERs, emoticons, perhaps a generic word for the mentioned URLs in the context, etc. If you take into account these extra considerations in your model (which might or might not affect the final result), the number of vocabulary words would grow larger as for each of these tokens (we called them words so far) we need to have a "word embedding" which can be either pre-trained or randomly initialized.

fqassemi commented 6 years ago

@ahmadpgh Thanks for your response. I agree, as I also mentioned word2vec type embedding represent a word as a vector. Just to summarize what I understood, an embedding in keras works like an extra layer at the interface of words and the rest of layers, where the matrix coefficient (W_ij) is just given by a pre-trained model such as GloVe. This extra layer will act as an input to the NN (that is, input_dim is equal to embedding_dim no matter the number of words in the text). If so, then Embedding Layer makes all sense to me.

ahmadpgh commented 6 years ago

@fqassemi You are very welcome. I fully updated my answer above. In our toy example, the _inputdim is 5 (i.e. 4+1) and the _embeddingdim is 2.