Incorporating Word Vectors rather than using an Embedding class

vindiesel commented 8 years ago

I am solving an NLP task and I am trying to model it directly as a sequence using different RNN flavors. How can I use my own Word Vectors rather than using an instance of layers.embeddings.Embedding?

dandxy89 commented 8 years ago

I am running the code I linked to above on an old Dell laptop, and it's running fine. If you use the very well prepared documentation on Keras.io and the examples you should easily be able to do what you have described above.

You can bypass the Gensim requirement and get the model to learning its own embedding matrix as it trains. I would recommend taking a look at any of the "IMDB" examples that are available.

anujgupta82 commented 8 years ago

@Sandy4321 : My code is the one provided by @dandxy89

anujgupta82 commented 8 years ago

@dandxy89 I used your code to train a "stacked lstm" and got 0.8710 My attempt is to push lstm to its limits for imdb classification.

replaced model.add(input_dim)

by

model.add(LSTM(1024, return_sequences=True)) # return_sequences=True forces it to return a sequence model.add(Dropout(0.3)) model.add(LSTM(1024)) model.add(Dropout(0.3)) model.add(Dense(1, activation='sigmoid'))

When do you know its a good time to try a deep LSTM ?

Sandy4321 commented 8 years ago

Anuj may you share code ?

On Mon, Mar 14, 2016 at 2:45 AM, Anuj Gupta notifications@github.com wrote:

@dandxy89 https://github.com/dandxy89 I used your code to train a "stacked lstm" and got 0.8710 My attempt is to push lstm to its limits for imdb classification. Input to a LSTM unit is a sequence, by default it doesn't return a sequence

model.add(LSTM(1024, return_sequences=True)) # return_sequences=True forces it to return a sequence model.add(Dropout(0.3)) model.add(LSTM(1024)) model.add(Dropout(0.3)) model.add(Dense(1, activation='sigmoid'))

When do you know that a deep LSTM might help ?

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-196168945.

anujgupta82 commented 8 years ago

@Sandy4321 https://github.com/anujgupta82/DeepNets/tree/master/LSTM

anujgupta82 commented 8 years ago

@viksit @farizrahman4u any suggestion as to how can I further improve the results (code ^)

Sandy4321 commented 8 years ago

new drop out for embeddings layer : conventional drop outs leads to overfititng http://arxiv.org/abs/1512.05287 A Theoretically Grounded Application of Dropout in Recurrent Neural Networks

Yarin Ga

Sandy4321 commented 8 years ago

Anuj super thanks a lot but https://github.com/anujgupta82/DeepNets/blob/master/LSTM/IMDB_Embedding_w2v_LSTM_3.ipynb Train your own w2v model on the dataset vocab if it is needed to use ready w2v model for example from gensim?

On Mon, Mar 14, 2016 at 12:45 PM, Anuj Gupta notifications@github.com wrote:

@Sandy4321 https://github.com/Sandy4321 https://github.com/anujgupta82/DeepNets/tree/master/LSTM

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-196405308.

anujgupta82 commented 8 years ago

@Sandy4321 have done that too, used pre-trained google word2vec model - ""GoogleNews-vectors-negative300.bin" but got very similar results.

Will share those notebooks too

PiranjaF commented 8 years ago

Thanks for the scripts! It is unclear to me whether unknown words should be removed from the sentence, added as a separate token in the word2vec or "masked" in Keras. Perhaps @dandxy89 or someone else could help explain this?

dandxy89 commented 8 years ago

If a word does not appear a number of things that can be done are:

Stemming
Lemmatization
Synonyms
Morphological analyzers
Spelling Correction
Re-train the w2v model and update the NN model but setting the learning very low before continuing with the training.

The NLTK or SpaCy packages are very good for the first few points :+1:. So is the NLTK Book

It really depends what you intend to use your model for... If this in production environment then I would suggest simply using an unknown label in your dictionary and keep the masking for padding.

anujgupta82 commented 8 years ago

@dandxy89 can we change the dataset/vocab during further training of a word2vec model ? So, for example can i take google's pretrained model "GoogleNews-vectors-negative300.bin" and further train it on my dataset

Why I might want to do so: To compensate for the lack of huge data unlike googlenews dataset but to finetune the model to my dataset.

My understanding says gensim surely does not allow this

PiranjaF commented 8 years ago

@dandxy89 Thanks for the help! I'm curious as to why you in your script both adds 1 to n_symbols and afterwards adds 1 more when setting input_dim. Don't you only need len(vocab) + 1 (not len(vocab) + 2) to account for the 0th index?

dandxy89 commented 8 years ago

@PiranjaF error on my part - intend to be used as an example only...

@anjishnu store the vectors somewhere since the bin file is huge then feed them into models. Read the documentation provided by Gensim and look at the source code for more insight in to how it is working...

PiranjaF commented 8 years ago

@dandxy89 No worries - it's great code. I'm probably overlooking something really simple, but where do you get the IMDB dataset from?

around1991 commented 8 years ago

from keras.datasets import imdb

BrianMiner commented 8 years ago

@dandxy89 Love the code example! Can you help a novice out, what do maxlen and input_length refer to and what are their effect?

dandxy89 commented 8 years ago

@BrianMiner

_inputlength / maxlen : Both in my example are equivalent to one another. The purpose of them is to transform each of the examples (sentences) to a fixed length. For example, sentences irrespective of length will be extended or reduced to that fixed size. Those that are extended are typically padded with 0s, it can be any value however it most cases 0 is used as the masking value.

For further information check the documentation and run the code line by line

@PiranjaF

http://www.cs.cornell.edu/people/pabo/movie-review-data/ : I used that one instead of the Keras version as it is in its raw format

BrianMiner commented 8 years ago

Ah, so if there are 105 words in a sentence, the last 5 are dropped?

dandxy89 commented 8 years ago

Yes It's one of the decisions you have to make during the model building phase.Same as if you want to include stopwords, how to deal with unseen words, pruning the vocabulary ect....

CaiyiZhu commented 8 years ago

@farizrahman4u @viksit

""" There are 3 approaches:

Learn embedding from scratch - simply add an Embedding layer to your model
Fine tune learned embeddings - this involves setting word2vec / GloVe vectors as your Embedding layer's weights.
Use word word2vec / Glove word vectors as inputs to your model, instead of one-hot encoding.

The third one is the best option(Assuming the word vectors were obtained from the same domain as the inputs to your models. For e.g, if you are doing sentiment analysis on tweets, you should use GloVe vectors trained on tweets).

In the first option, everything has to be learned from scratch. You dont need it unless you have a rare scenario. The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.

"""

Just add some resource on which option is better. http://emnlp2014.org/papers/pdf/EMNLP2014181.pdf shows that option 2 is often better than option 3, and they are both better than option 1. The paper experiments with different datasets, though it is using CNN

farizrahman4u commented 8 years ago

Quoting myself:

The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.

AllardJM commented 8 years ago

I found this "issue" quite informative :) I am curious if a valid application for word embeddings like this is for search engine relevance, where we have the search term, title and description of a web page and the relevance rating (1-5). Something similar to this: https://karthkk.wordpress.com/2016/03/22/deep-learning-solution-for-netflix-prize/ but with query and title text instead of movie and user ids. I started coding this in Keras and was curious if it made sense?

dandxy89 commented 8 years ago

@AllardJM Something similar to what you have described on Reddit - like2vec

BrianMiner commented 8 years ago

I am wondering more about the embedding layer in keras. Is there any notion of context words around each word like this: http://deeplearning.net/tutorial/rnnslu.html#word-embeddings

Or is the embedding simply each word to a vector to be trained?

snowxiaoru commented 8 years ago

If I want to use the third methods ,how can I transfer my data into the correct format ? My data set is big ,and my wordvec is 200D, when I use pad_sequences, it shows memeory error? @farizrahman4u

viksit commented 8 years ago

@BrianMiner the embedding layer in keras, by default, simply transforms integers (one hot representations) into dense vectors of fixed size. For eg, in the docstring,

[[4], [20]] -> [[0.25, 0.1], [0.6, -0.2]]

viksit commented 8 years ago

@snowxiaoru can you post your code here? It's easier to understand when you have the actual code!

ylqfp commented 8 years ago

Thanks ! It really helps me!

kgail commented 8 years ago

Tutorials on word2vec and vec2word on RNN?

braingineer commented 8 years ago

to anyone who would like some code on converting embeddings to a matrix, I have some here:

https://github.com/braingineer/ikelos/blob/master/ikelos/data/embeddings.py#L66

eventually I'll write up a step, but not any time in May.

juliohm commented 8 years ago

Can someone explain why in @sergeyf snippet there are two additions of the number 1?

First addition:

n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)

I fully understand that we have to add a masking row to the beginning of the embedding matrix.

Second addition:

embedding_weights = np.zeros((n_symbols+1,vocab_dim))

Why add 1 again? Also the name of the variables are misleading, vocab_dim has nothing to do with the vocabulary, it is the dimension of the embedding vector.

If someone familiar with the Keras implementation can write an organized documentation about the underlying conventions with this pad mask, that would be highly appreciated.

sergeyf commented 8 years ago

I forget what I did there but see here for a more definitive example: https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py

juliohm commented 8 years ago

Thank you @sergeyf, the code in the linnk looks correct, just one addition. Have a nice day...

prhbrt commented 7 years ago

Skimmed al comments and did not read everything, but imho these reasons exist for fixing the embedding:

Keeping each sentence as a list of word2vecs takes more memory than just the index for each word.
On the fly transformation is a hassle, and is basically what Embedding would do for you (except that it also trains).
For a large dictionary the number of trainable weights in an Embedding can be huge, wasting a lot of computational power which one might rather use for training a different part of the net (for example the actual RNN, convolutional layer or whatever) Remember, neural networks are not just about being able to do stuff, but also about being able to do it with the least computational power possible :)

Of course, you can make your own layer which does a matrix-lookup for the right vector, but that's basically what Embedding does but without the training. So I did this:

class FixedLayerMixin:
    def build(self, *args, **kwargs):
        super(FixedLayerMixin, self).build(*args, **kwargs)
        self.trainable_weights = []

class FixedEmbedding(FixedLayerMixin, Embedding):
    pass

This mixin allows any layer to work as is, but without training. At least, I hope it does. Any notes of Keras developers on whether what I'm doing is OK are, of course, well appreciated :)

MaratZakirov commented 7 years ago

I still do not understand can I use Embedding layer and NOT TUNE IT AT ALL just as the saving memory mechanism?

prhbrt commented 7 years ago

An Embedding Layer has weights, and they need to be tuned: https://github.com/fchollet/keras/blob/master/keras/layers/embeddings.py#L95

There are, however, pretrained word2vec models, which are already trained and hence need not to be retrained if they fit your needs.

Cospel commented 7 years ago

What should be the padding vector if i am using pretrained word2vec from google?

Should i use word like 'stop' and transform it to vector with google word2vec model or should i use just vector of zeros?

vijay120 commented 7 years ago

What happened to the weights argument to the Embedding function? I am following the tutorial on the blog here: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html where we pass in the pre-trained embedding matrix. However, in the latest keras version, the weights function has been deleted. The PR that deleted it is here: https://github.com/fchollet/keras/commit/023331ec2a7b0086abfc81eca16c84a1692ee653

MaratZakirov commented 7 years ago

@prinsherbert This approach may lead to over-fitting. For example you have 100000 words (pretty small vocabulary) all with 100 vector size if they are trainable you will get 10 millions free parameters "from nothing".

prhbrt commented 7 years ago

@MaratZakirov Yes, it is wise to pretrain or just train with a large corpus such as Wikipedia to prevent overfitting. And yes, you need a high document to word ratio.

MadhumitaSushil commented 7 years ago

From what I understand, the Embedding layer in Keras performs a lookup for a word index present in an input sequence, and replaces it with the corresponding vector through an embedding matrix.

However, what I am confused about it what happens when we want to test/apply a model on unknown data? For example, if there is a word in the test document which is not present in the training vocabulary, we could compute the corresponding vectors using character n-grams using pre-trained Fast-text model. However, this term would not be present in the word_index that was generated while training the model, and a lookup in the embedding matrix would fail.

One possible solution can be to create a word_index from the entire dataset, including the test data, as done for this example: https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py However, I would like to avoid that so that the model is applicable to unknown data.

Any suggestion for workarounds for that?

prhbrt commented 7 years ago

I think unknown words in general are mapped to random vectors, but this is not what the Embedding-layer does. In text processing people often consider the vocabulary prior knowledge. And if your model is word based, you will unlikely learn about words not seen in the training data anyway.

Tweets are examples of data with many words that you'll likely will not encounter during training, but that do appear in testing, because people make typos. More importantly, they also add (hash) tags, which are typically some concatenation of words. If you search the literature, you'll notice many tweet-classifiers use a character level convolutional layer, and than some classifier on top of that (like LSTM).

So if you want to generalize to new words, consider character level features/classifiers.

BrianMiner commented 7 years ago

It seems like these responses ignore the main issue, is if weights argument was removed in Keras 2.0?

MadhumitaSushil commented 7 years ago

@BrianMiner Yes, it seems like embedding weights have indeed been removed. My guess is it can still be used via the option 'embeddings_initializer' and coding a custom initializer which returns embedding weights. @vijay120

@prinsherbert Yes, I am aware of such techniques. I was only wondering of an elegant way to add embeddings of OOVs based on character n-gram embeddings using Fast-text during test to avoid completely ignoring OOV terms in a word based model.

monod91 commented 7 years ago

Hello @madhumita-git @prinsherbert

I also have a similar scenario, where I want to use a character-based model for another NLP task. Basically, my input data is a 3D tensor containing n sentences, each containing m words, each represented as a vector of o characters. So, 1sr dimension = batch, 2nd dimension = temporal dimension (max length of sentence), 3rd dimension = max characters of the word

So that my model starts with: word_input = Input((self.max_length, self.max_word_length)) Now I would like to use characters embeddings on the character level (and on the top of this, using 1D convolution+MaxPooling to obtain a fixed-size vector representation of the word, in a similar way as in this paper: "Learning Character-level Representations for Part-of-Speech Tagging" http://proceedings.mlr.press/v32/santos14.pdf).

Any idea about how I could use an Embedding layer in such way in keras?

Huzefa-Calcutta commented 7 years ago

If you are working on large data, it is recommended to directly use word vectors as an input to LSTM layer rather than having an embedding layer. This avoids matrix multiplication which takes a lot of time if there are more sequences.

naisanza commented 6 years ago

@sergeyf

Hi! Is it just me or would your embedding_weights numpy zeros array be a +1 too wide?

Since you've already set the width to be n_symbols = len(index_dict) + 1 to account for 0th index

But in your embedding_weights it's embedding_weights = np.zeros((n_symbols+1,vocab_dim)), which would be the same as the original n_symbols = len(index_dict) + 2? Why the extra +1 to length?

vocab_dim = 300 # dimensionality of your word vectors
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols+1,vocab_dim))
for word,index in index_dict.items():
    embedding_weights[index,:] = word_vectors[word]

# assemble the model
model = Sequential() # or Graph or whatever
model.add(Embedding(output_dim=rnn_dim, input_dim=n_symbols + 1, mask_zero=True, weights=[embedding_weights])) # note you have to put embedding weights in a list by convention
model.add(LSTM(dense_dim, return_sequences=False))  
model.add(Dropout(0.5))
model.add(Dense(n_symbols, activation='softmax')) # for this is the architecture for predicting the next word, but insert your own here

sergeyf commented 6 years ago

I think that +2 thing only made sense in an old version of Keras, and I absolutely can't remember why anymore. But I do remember being annoyed.

I should probably update this comment, huh? It's like documentation by now!

On Sep 2, 2017 4:50 PM, "Eric" notifications@github.com wrote:

@sergeyf https://github.com/sergeyf

Hi! Is it just me or would your embedding_weights numpy zeros array be a +1 too wide?

Since you've already set the width to be n_symbols = len(index_dict) + 1 to account for 0th index

But in your embedding_weights it's embedding_weights = np.zeros((n_symbols+1,vocab_dim)), which would be the same as the original n_symbols = len(index_dict) + 2? Why the extra +1 to length?

vocab_dim = 300 # dimensionality of your word vectors n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking) embedding_weights = np.zeros((n_symbols+1,vocab_dim)) for word,index in index_dict.items(): embedding_weights[index,:] = word_vectors[word]

assemble the model

model = Sequential() # or Graph or whatever model.add(Embedding(output_dim=rnn_dim, input_dim=n_symbols + 1, mask_zero=True, weights=[embedding_weights])) # note you have to put embedding weights in a list by convention model.add(LSTM(dense_dim, return_sequences=False)) model.add(Dropout(0.5)) model.add(Dense(n_symbols, activation='softmax')) # for this is the architecture for predicting the next word, but insert your own here

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-326771127, or mute the thread https://github.com/notifications/unsubscribe-auth/ABya7HryYQV4Ygn9iVChsKgLxS33U9i8ks5sec2hgaJpZM4GRCom .

JakSla commented 6 years ago

Hello!

Following this: https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html I tried to use pretrained word embeddings with my Embedding layer in Keras. Yet I am getting this error: ValueError: Layer weight shape (10000, 100) not compatible with provided weight shape (88585, 100) at this line: model.add(Embedding(max_features, 100, input_length=max_review_length,mask_zero=True, weights=[embedding_matrix]))

From what I see Keras 2 + is not supporting embedding weights (yes?). I've tried older Keras 1.2 and 1.1.2 versions, but they still gave me the same error.

Anyone can advise whether I am doing something wrong? Or what would be the proper way to use my own embeddings in Embedding layer?

Thanks! Providing the code I am using below:

from keras.datasets import imdb
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Convolution1D, Flatten, Dropout
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.callbacks import TensorBoard
from gensim.models import word2vec
import numpy as np
import os

import keras
#Using keras to load the dataset with the top_words
max_features = 10000 #max number of words to include, words are ranked by how often they occur (in training set)
max_review_length = 1600

(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print 'loaded dataset...'
#Pad the sequence to the same length
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length)
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length)

index_dict = keras.datasets.imdb.get_word_index()

print 'loading glove...'
embeddings_index = {}
f = open(os.path.join('/home/ejaksla/PycharmProjects/MachineLearningPlayground/BachelorDegree/glove_word2vec/glove.6B/', 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print 'creating embedding matrix...'
embedding_matrix = np.zeros((len(index_dict) + 1, 100))
for word, i in index_dict.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
print('Found %s word vectors.' % len(embeddings_index))

print 'assembling model..'
# Using embedding from Keras
model = Sequential()
model.add(Embedding(max_features, 100, input_length=max_review_length,mask_zero=True, weights=[embedding_matrix]))

keras-team / keras

Incorporating Word Vectors rather than using an Embedding class #853

assemble the model