keras-team / keras

Deep Learning for humans
http://keras.io/
Apache License 2.0
61.33k stars 19.39k forks source link

Incorporating Word Vectors rather than using an Embedding class #853

Closed vindiesel closed 5 years ago

vindiesel commented 8 years ago

I am solving an NLP task and I am trying to model it directly as a sequence using different RNN flavors. How can I use my own Word Vectors rather than using an instance of layers.embeddings.Embedding?

sergeyf commented 8 years ago

UPDATED NOV 13, 2017

You have to pass a weight matrix to the Embedding layer. Here is an example:

Let's say index_dict is a dictionary that maps all the words in your dictionary to indices from 1 to n_symbols (0 is reserved for the masking).

So, an example index_dict is the following:

{
 'yellow': 1,
 'four': 2,
 'woods': 3,
 'ornate': 31,
 'woody': 5,
 'cyprus': 6,
 'marching': 7,
 'canes': 8,
 'caned': 9,
 'hermann': 10,
 'lord': 11,
 'meadows': 12,
 'shaving': 13,
 'swivel': 14
...
}

And you also have a dictionary called word_vectors that maps words to vectors like so:

{
 'yellow': array([0.1,0.5,...,0.7]),
 'four': array([0.2,1.2,...,0.9]),
...
}

The following code should do what you want

# assemble the embedding_weights in one numpy array
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols, vocab_dim))
for word,index in index_dict.items():
    embedding_weights[index, :] = word_vectors[word]

# define inputs here
embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([embedding_weights])

embedded = embedding_layer(input_layer)
# ... continue model definition here

Note that this kind of setup will result in your embeddings being trained from their initial point! If you want them fixed, then you have to set trainable=False.

farizrahman4u commented 8 years ago

No need of skipping the embedding layer.. setting word vectors as the initial weights of embedding layer is a valid approach. The word vectors will get fine tuned for the specific NLP task during training.

dandxy89 commented 8 years ago

Has anybody else attempted to embedded the word vectors into a model?

I've managed to create the model however Im not able to achieve a worthwhile level of accuracy yet. I've used the "20 newsgroups dataset" from scikit-learn to test this model, with my own w2v vectors. The best accuracy I've achieved so far is 28%, over 5 epochs, which is not great (scikit script best - 85%). I intend to continue experimenting with the network configuration (inner dimensions and epochs initially). I do suspect that the number of dimentions is too high for such a small dataset (1000 samples).

Will update again if the results improve

viksit commented 8 years ago

From what I've seen, training your own vectors on top of a custom dataset has given me much better accuracy within that domain.

That said - any updates on this?

farizrahman4u commented 8 years ago

@viksit

There are 3 approaches:

The third one is the best option(Assuming the word vectors were obtained from the same domain as the inputs to your models. For e.g, if you are doing sentiment analysis on tweets, you should use GloVe vectors trained on tweets).

In the first option, everything has to be learned from scratch. You dont need it unless you have a rare scenario. The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.

viksit commented 8 years ago

@farizrahman4u agreed on those counts. The domains are a bit more specific and I have a lot more luck with option (2) than with (1) or (3) so far. An easy way to address the size problem with (2) is to prune out the vocabulary itself to the top k words.

XuesongYang commented 8 years ago

@farizrahman4u Thanks for sharing the ideas. I have a question on your first approach.

Learning embeddings from scratch: each word in the dictionary is represented as one-hot vector, and then this vector is embedded as a continuous vector after applying embedding layer. Is that right?

farizrahman4u commented 8 years ago

@MagicYoung Yes.

dandxy89 commented 8 years ago

To follow up from @sergeyf suggestions... @viksit @MagicYoung @vindiesel

Take a look at my attempt (Script and Data) - it uses Gensim. This could easily all be done in Keras using the skipgram of course to remove dependencies...

sergeyf commented 8 years ago

Looks cool! Are the results to your liking?

dandxy89 commented 8 years ago

Results after 2 epochs: Validation Accuracy: 0.8485 Loss: 0.3442

liyi193328 commented 8 years ago

@dandxy89 Wonderful examples for me.I'll try it in my context. Thanks. And if I want to feed pre-trained word2vec to lstm directly, how to handle different sequence lengths? My trying is: setting maxlen=200(every sequence) and word2vec dim=600, if some sequence's length is 100, then the first 100 rows([0:100)) has float numbers(every row is a word vector), then [100:200) rows is padding with zeors, which means the remaining rows is all zeros. But after doing that, I get loss NaN, which confuses me for a long time. like the issue #1360 , what I can do then? Thanks.

viksit commented 8 years ago

@liyi193328 are you using keras' sequence.pad_sequences(myseq, maxlen=maxlen)?

Looks like your padding is to the right of the vectors where as it should be to the left.

from keras.preprocessing import sequence
In [69]: sequence.pad_sequences([[1,2], [1,2,3]], maxlen=10)
Out[69]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
       [0, 0, 0, 0, 0, 0, 0, 1, 2, 3]], dtype=int32)

Secondly, you should be using categorical_crossentropy as your loss as opposed to mean_squared_error. See https://github.com/fchollet/keras/issues/321

liyi193328 commented 8 years ago

@viksit Thanks. I don't use sequence.pad_sequence. I pad zeros for last rows manually. Because the input of pad_sequence is 2D array, setting A, then A[i,j] is the index of a word in vocabulary, The each row of A represents a sentences; But in my case, every word is a 600 dim vector, not a index. So I can't use it. I don't think is what I think right? How I can pad zeros then? Thanks.

viksit commented 8 years ago

You need to pad before you convert the words to vectors (presumably you have a step where you have only word indexes).

liyi193328 commented 8 years ago

@viksit Thanks. But How do I change the word index to word vector ? Actually I don't use word indexes, I use word vector directly. And More specifically, three sentences like:

[ [ He, like, keras], [learning], [like, keras] ]

so Index 2D array is [ [1,2,3], [4], [2,3] ] ---padding---> [ [1,2,3], [0,0,4], [0,2,,3] ](index of each is given) the word vector(4 dim) each is:

He -> [1,1,1,1], like->[2,2,2,2], keras->[3,3,3,3], learning->[5,5,5,5]

then after padding ,the 3D array shape is (2,3,4), like:

[
[ [1,1,1,1], [2,2,2,2], [3,3,3,3] ],
[ [0,0,0,0], [0,0,0,0], [5,5,5,5] ],
[ [0,0,0,0], [2,2,2,2], [3,3,3,3] ]
]

if the specific example right? Thanks.

viksit commented 8 years ago

Thats correct.

liyi193328 commented 8 years ago

@viksit Thanks. My mistake in initialing array with np.empty. should use np.zeros. Everything goes well now.

taozhijiang commented 8 years ago

@dandxy89 @farizrahman4u @viksit

Do you mean, when initialize Embedding Layer with weights learned for other corpus calculated word2vec, the LSTM model can quickly convergence?

Here is my code, just iter 2-3 times will get high score.

https://github.com/taozhijiang/chinese_nlp/blob/master/DL_python/dl_segment_v2.py

talentlei commented 8 years ago

Thanks for sharing. I benefit a lot. @

ngopee commented 8 years ago

Follow @liyi193328 comment:

If this is my X_train: [ [ [1,1,1,1], [2,2,2,2], [3,3,3,3] ], [ [0,0,0,0], [0,0,0,0], [5,5,5,5] ], [ [0,0,0,0], [2,2,2,2], [3,3,3,3] ] ]

How should I structure my Y_train given each word (word-level training) will have it's own tag(Y and is also multi class). Is it like:

[ [1,2,3], [0,0,3], [0,2,1] ] ?

Because I am having an error: "Exception: All input arrays and the target array must have the same number of samples."

Thank you very much!

liyi193328 commented 8 years ago

@ngopee For multi classes, if the label class is 2(total 3 classes), then it must be transformmed as 1D array [0,0,1]; Specifically if the sentence has x tokens. if every token has a label and has y classes, then all the labels's(Y_train) shape is (x,y), 2D array. May you solve problemes.

ngopee commented 8 years ago

@liyi193328

Thank you very much for your reply!

Yes, I realised that later on but now having another issue which I'm not sure how to fix:

"if every token has a label and has y classes, then all the labels's(Y_train) shape is (x,y), 2D array". I'm not sure I follow this part. But below is what I have so far.

So, in my case, each token(Word vector) has is own tag.

Here is a sample of my input:

X_train =
[
 [ 8496  1828  …5447]
 [ 9096  8895  …13890]
 [ 5775   115 … 15037]
 [ 6782  9918  …  5048]
]

Y_train=
[
array([[ 0.,  0.,  0.,  1.], [ 0.,  0.,  1.,  0.], …[ 0.,  0.,  0.,  1.]]), 
array([[ 0.,  0.,  1.,  0.], [ 0.,  0.,  0.,  1.],…[ 0.,  0.,  1.,  0.]]), 
array([[ 0.,  0.,  1.,  0.], [ 0.,  0.,  0.,  1.], …[ 0.,  0.,  1.,  0.]]), 
array([[ 0.,  1.,  0.,  0.], [ 0.,  1.,  0.,  0.], …[ 0.,  1.,  0.,  0.]])
]

I am getting this error:

AssertionError: Theano Assert failed!
Apply node that caused the error: Assert(Elemwise{Composite{(i0 - EQ(i1, i2))}}.0, Elemwise{eq,no_inplace}.0)
Inputs types: [TensorType(int8, matrix), TensorType(int8, scalar)]
Inputs shapes: [(1, 100), ()]
Inputs strides: [(100, 1), ()]
Inputs values: ['not shown', array(0, dtype=int8)]

Here is my code:

vocab_dim = 300 maxlen = 100 batch_size = 1 n_epoch = 2

print('Keras Model...')
model = Sequential()  # or Graph or whatever
model.add(Embedding(output_dim=vocab_dim,
                    input_dim=n_symbols + 1,
                    mask_zero=True,
                    weights=[embedding_weights])) 
model.add(LSTM(vocab_dim, return_sequences=True))
model.add(Dropout(0.3))
model.add(TimeDistributedDense(input_dim=vocab_dim, output_dim=1))

print('Compiling the Model...')
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              class_mode='categorical')

print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,
          validation_data=(X_test, y_test), show_accuracy=True)

print("Evaluate...")
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size,
                            show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)

Thank you very much!

liyi193328 commented 8 years ago

@ngopee model.add(TimeDistributedDense(input_dim=vocab_dim, output_dim=1)): output_dim == 1? I think the number is the number of classes, while I do not read the whole context.

ngopee commented 8 years ago

@liyi193328

Thank you very much for pointing that out. That would have been yet another mistake.

However this did not seem to fix the issue I previously had. Any insight on what I could be doing wrong?

Thanks!

liyi193328 commented 8 years ago

@ngoee Sorry for late reply! The meaning of your code is a many-to-one classification, but your actual goal is many-to-many. So it needs to change the logic.

dandxy89 commented 8 years ago

@ngopee

What are the dimensions of your output/target?

ngopee commented 8 years ago

@liyi193328 Yes, you are right. I thought the TimeDistributedDense layer is what I needed to make the model many to many ? This is what I understood by reading other Keras issues. Could you please explain to me what actually is the TimeDistributedDense layer or point me to any good reading material?

Is it possible for me to have only 1 LSTM layer and no more layer afterwards?

@dandxy89 I did a mistake above but my output dimension is 3. I have 3 classes.

Thank you very much for your replies! I appreciate it!

liyi193328 commented 8 years ago

@ngopee https://github.com/fchollet/keras/issues/1029 may help you.

ngopee commented 8 years ago

@liyi193328

Thank you very much! I got rid of the embedding layer and I'm feeding my pre-trained word vectors to the LSTM layer. Now it works!

Thanks again for the help!

PiranjaF commented 8 years ago

Which is more efficient? Feeding the pre-trained word vectors directly or using the embedding layer with word vector weights? I would think that the two options are equivalent if you just set layer.trainable = False.

viksit commented 8 years ago

@PiranjaF are you asking in terms of performance of the model?

PiranjaF commented 8 years ago

@viksit I think that the two approaches would be identical in terms of the final solution after X iterations, but have differences in training time. Is that true and which approach would then be the fastest?

PiranjaF commented 8 years ago

So yes, I'm asking in terms of the training performance.

viksit commented 8 years ago

From my experiments, they were definitely off. I've seen better quality when taking global word2vec data and tuning it for a domain by using them as starting weights for an embedding layer. But less performant. Letting an embedding layer loose on your own domain data is faster, but quality wise (at least in my scenario), it suffered more.

PiranjaF commented 8 years ago

But did you also set the weights for the embedding layer using the global word2vec and then turn off the trainable attribute for the embedding layer? That should make them identical I would guess.

anujgupta82 commented 8 years ago

@dandxy89: I tried replicating your script. The Best I got was Validation Accuracy: 0.7804, Loss: 0.4597 even after 30 epochs.

imdb_word2vec-web_lstm_trg imdb_word2vec-web_lstm_validation

Any suggestions ? I have blatantly copied your code and ran it as it is. No where close to the results you reported: Validation Accuracy: 0.8485, Loss: 0.3442

screen shot 2016-03-08 at 12 03 53 pm

anujgupta82 commented 8 years ago

@ngopee @farizrahman4u @viksit @liyi193328 @dandxy89

Can you pls share your piece of code where you directly fed the word-vectors to the model (without embedding layer) ?

Sandy4321 commented 8 years ago

all looks great to sum up may you pls recommend link to final best code for keras word2vec lstm pls, better for short texts and sentiment analysis, but actually any code for keras word2vec lstm Thank you very much in advance...

dandxy89 commented 8 years ago

@anujgupta82 - I have modified my script and pushed to my repository. The modification I made was to increase the number of iterations when building the word2vec model. Also, I never got around directly feeding the vectors into the model instead of an embedding matrix. It should be a fairly small change in order to do that and quite interesting to see the difference.

25000/25000 [==============================] - 1321s - loss: 0.2803 - acc: 0.8849 - val_loss: 0.3220 - val_acc: 0.8609
Evaluate...
25000/25000 [==============================] - 320s     
('Test score:', 0.32195980163097382)
('Test accuracy:', 0.86087999999999998)

@Sandy4321 - take a look at this. It may help.

anujgupta82 commented 8 years ago

@dandxy89

Ran your revised code, still got much lower numbers

Epoch 1/2 25000/25000 [==============================] - 224s - loss: 0.5040 - acc: 0.7463 - val_loss: 0.4709 - val_acc: 0.7625 Epoch 2/2 25000/25000 [==============================] - 224s - loss: 0.3768 - acc: 0.8274 - val_loss: 0.4571 - val_acc: 0.7810 Evaluate... 25000/25000 [==============================] - 37s
('Test score:', 0.4571321966743469) ('Test accuracy:', 0.78095999999999999)

Can someone else try his code and report the numbers u r getting ? @fchollet: any reasons for this ? ran the experiment couple of times to eliminate chance

ngopee commented 8 years ago

@anujgupta82

Feeding the vectors directly:

model = Sequential()                                                      
model.add(LSTM(num_hidden_nodes , return_sequences=True, input_shape=(maxlen, vocab_dim)))
model.add(Dropout(0.2))                                      
model.add(LSTM(num_hidden_nodes , return_sequences=True))                                        
model.add(Dropout(0.2))                                                                                         
model.add(TimeDistributedDense(output_dim =nb_classes,input_dim= num_hidden_nodes , activation='softmax'))
dandxy89 commented 8 years ago

@anujgupta82 what backend are you using? I used Tensorflow, not that I think that will have a significant impact.

anujgupta82 commented 8 years ago

@dandxy89: I am using theano (latest version) and running my code on AWS g2.2xlarge instance. My virtual environment is as follows:

backports.ssl-match-hostname==3.4.0.2 boto==2.39.0 bson==0.4.1 bz2file==0.98 certifi==2015.9.6.2 Cython==0.23.4 funcsigs==0.4 gensim==0.12.3 graphviz==0.4.10 h5py==2.5.0 httpretty==0.8.10 Keras==0.2.0 matplotlib==1.4.3 mock==1.3.0 nltk==3.1 nose==1.3.7 numpy==1.10.1 pandas==0.17.0 pbr==1.8.1 plotly==1.8.11 pydot==1.0.2 pymongo==3.0.3 pyparsing==1.5.7 python-dateutil==2.4.2 pytz==2015.7 PyYAML==3.11 requests==2.8.1 scikit-learn==0.16.1 scipy==0.16.0 seaborn==0.6.0 six==1.10.0 sklearn==0.0 smart-open==1.3.2 Theano==0.7.0 tornado==4.2.1 wheel==0.24.0

Will want to know what is causing the difference of results ? can you pls add your virtual environment as requirements.txt in your git repo. It will help me replicate things better. Menwhile can you once re-run your code with my environment

anujgupta82 commented 8 years ago

@ngopee : I will try your suggestion and get back to you soon

dandxy89 commented 8 years ago

Using sudo pip install keras gensim -U?

In the meantime I will rerun the code on Theano and let you know my result.

anujgupta82 commented 8 years ago

@dandxy89 : upgraded the packages and got the results

Train... Train on 25000 samples, validate on 25000 samples Epoch 1/5 25000/25000 [==============================] - 211s - loss: 0.4418 - acc: 0.7908 - val_loss: 0.3645 - val_acc: 0.8428 Epoch 2/5 25000/25000 [==============================] - 211s - loss: 0.2836 - acc: 0.8842 - val_loss: 0.3214 - val_acc: 0.8624 Epoch 3/5 25000/25000 [==============================] - 212s - loss: 0.1818 - acc: 0.9320 - val_loss: 0.3684 - val_acc: 0.8576 Epoch 4/5 25000/25000 [==============================] - 212s - loss: 0.0992 - acc: 0.9650 - val_loss: 0.4339 - val_acc: 0.8570 Epoch 5/5 25000/25000 [==============================] - 211s - loss: 0.0580 - acc: 0.9800 - val_loss: 0.4965 - val_acc: 0.8512 Evaluate... 25000/25000 [==============================] - 57s
('Test score:', 0.49650939433217051) ('Test accuracy:', 0.85124)

Thank you so much for all the help

Sandy4321 commented 8 years ago

Anuj, May you share link to your code, pls It would be great to try On Mar 13, 2016 6:46 AM, "Anuj Gupta" notifications@github.com wrote:

@dandxy89 https://github.com/dandxy89 : upgraded the packages and got the results

Train... Train on 25000 samples, validate on 25000 samples Epoch 1/5 25000/25000 [==============================] - 211s - loss: 0.4418 - acc: 0.7908 - val_loss: 0.3645 - val_acc: 0.8428 Epoch 2/5 25000/25000 [==============================] - 211s - loss: 0.2836 - acc: 0.8842 - val_loss: 0.3214 - val_acc: 0.8624 Epoch 3/5 25000/25000 [==============================] - 212s - loss: 0.1818 - acc: 0.9320 - val_loss: 0.3684 - val_acc: 0.8576 Epoch 4/5 25000/25000 [==============================] - 212s - loss: 0.0992 - acc: 0.9650 - val_loss: 0.4339 - val_acc: 0.8570 Epoch 5/5 25000/25000 [==============================] - 211s - loss: 0.0580 - acc: 0.9800 - val_loss: 0.4965 - val_acc: 0.8512 Evaluate... 25000/25000 [==============================] - 57s

('Test score:', 0.49650939433217051) ('Test accuracy:', 0.85124)

Thank you so much for all the help

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-195932433.

dandxy89 commented 8 years ago

Great! Looking at the results @anujgupta82 the model is over-fitting after two epochs.

@Sandy4321 the link I recommended above contains all the data and scripts.

Sandy4321 commented 8 years ago

Dan, I see thanks, By the way do you know links to other attempts to use word2vec for sentiment analysis, may be pure rnn with enforced improved optimisation like Rprop , since I have only CPU lap top , so some fast calculation code is needed Thanks Sandy On Mar 13, 2016 10:25 AM, "Dan Dixey" notifications@github.com wrote:

Great! Looking at the results @anujgupta82 https://github.com/anujgupta82 the model is over-fitting after two epochs.

@Sandy4321 https://github.com/Sandy4321 the link I recommended above contains all the data and scripts.

— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-195964330.