Closed vindiesel closed 5 years ago
UPDATED NOV 13, 2017
You have to pass a weight matrix to the Embedding
layer. Here is an example:
Let's say index_dict
is a dictionary that maps all the words in your dictionary to indices from 1
to n_symbols
(0
is reserved for the masking).
So, an example index_dict
is the following:
{
'yellow': 1,
'four': 2,
'woods': 3,
'ornate': 31,
'woody': 5,
'cyprus': 6,
'marching': 7,
'canes': 8,
'caned': 9,
'hermann': 10,
'lord': 11,
'meadows': 12,
'shaving': 13,
'swivel': 14
...
}
And you also have a dictionary called word_vectors
that maps words to vectors like so:
{
'yellow': array([0.1,0.5,...,0.7]),
'four': array([0.2,1.2,...,0.9]),
...
}
The following code should do what you want
# assemble the embedding_weights in one numpy array
vocab_dim = 300 # dimensionality of your word vectors
n_symbols = len(index_dict) + 1 # adding 1 to account for 0th index (for masking)
embedding_weights = np.zeros((n_symbols, vocab_dim))
for word,index in index_dict.items():
embedding_weights[index, :] = word_vectors[word]
# define inputs here
embedding_layer = Embedding(output_dim=vocab_dim, input_dim=n_symbols, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
embedding_layer.set_weights([embedding_weights])
embedded = embedding_layer(input_layer)
# ... continue model definition here
Note that this kind of setup will result in your embeddings being trained from their initial point! If you want them fixed, then you have to set trainable=False
.
No need of skipping the embedding layer.. setting word vectors as the initial weights of embedding layer is a valid approach. The word vectors will get fine tuned for the specific NLP task during training.
Has anybody else attempted to embedded the word vectors into a model?
I've managed to create the model however Im not able to achieve a worthwhile level of accuracy yet. I've used the "20 newsgroups dataset" from scikit-learn to test this model, with my own w2v vectors. The best accuracy I've achieved so far is 28%, over 5 epochs, which is not great (scikit script best - 85%). I intend to continue experimenting with the network configuration (inner dimensions and epochs initially). I do suspect that the number of dimentions is too high for such a small dataset (1000 samples).
Will update again if the results improve
From what I've seen, training your own vectors on top of a custom dataset has given me much better accuracy within that domain.
That said - any updates on this?
@viksit
There are 3 approaches:
The third one is the best option(Assuming the word vectors were obtained from the same domain as the inputs to your models. For e.g, if you are doing sentiment analysis on tweets, you should use GloVe vectors trained on tweets).
In the first option, everything has to be learned from scratch. You dont need it unless you have a rare scenario. The second one is good, but, your model will be unnecessarily big with all the word vectors for words that are not frequently used.
@farizrahman4u agreed on those counts. The domains are a bit more specific and I have a lot more luck with option (2) than with (1) or (3) so far. An easy way to address the size problem with (2) is to prune out the vocabulary itself to the top k words.
@farizrahman4u Thanks for sharing the ideas. I have a question on your first approach.
Learning embeddings from scratch: each word in the dictionary is represented as one-hot vector, and then this vector is embedded as a continuous vector after applying embedding layer. Is that right?
@MagicYoung Yes.
Looks cool! Are the results to your liking?
Results after 2 epochs: Validation Accuracy: 0.8485 Loss: 0.3442
@dandxy89 Wonderful examples for me.I'll try it in my context. Thanks. And if I want to feed pre-trained word2vec to lstm directly, how to handle different sequence lengths? My trying is: setting maxlen=200(every sequence) and word2vec dim=600, if some sequence's length is 100, then the first 100 rows([0:100)) has float numbers(every row is a word vector), then [100:200) rows is padding with zeors, which means the remaining rows is all zeros. But after doing that, I get loss NaN, which confuses me for a long time. like the issue #1360 , what I can do then? Thanks.
@liyi193328 are you using keras' sequence.pad_sequences(myseq, maxlen=maxlen)?
Looks like your padding is to the right of the vectors where as it should be to the left.
from keras.preprocessing import sequence
In [69]: sequence.pad_sequences([[1,2], [1,2,3]], maxlen=10)
Out[69]:
array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 2],
[0, 0, 0, 0, 0, 0, 0, 1, 2, 3]], dtype=int32)
Secondly, you should be using categorical_crossentropy as your loss as opposed to mean_squared_error. See https://github.com/fchollet/keras/issues/321
@viksit Thanks. I don't use sequence.pad_sequence. I pad zeros for last rows manually. Because the input of pad_sequence is 2D array, setting A, then A[i,j] is the index of a word in vocabulary, The each row of A represents a sentences; But in my case, every word is a 600 dim vector, not a index. So I can't use it. I don't think is what I think right? How I can pad zeros then? Thanks.
You need to pad before you convert the words to vectors (presumably you have a step where you have only word indexes).
@viksit Thanks. But How do I change the word index to word vector ? Actually I don't use word indexes, I use word vector directly. And More specifically, three sentences like:
so Index 2D array is [ [1,2,3], [4], [2,3] ] ---padding---> [ [1,2,3], [0,0,4], [0,2,,3] ](index of each is given) the word vector(4 dim) each is:
then after padding ,the 3D array shape is (2,3,4), like:
if the specific example right? Thanks.
Thats correct.
@viksit Thanks. My mistake in initialing array with np.empty. should use np.zeros. Everything goes well now.
@dandxy89 @farizrahman4u @viksit
Do you mean, when initialize Embedding Layer with weights learned for other corpus calculated word2vec, the LSTM model can quickly convergence?
Here is my code, just iter 2-3 times will get high score.
https://github.com/taozhijiang/chinese_nlp/blob/master/DL_python/dl_segment_v2.py
Thanks for sharing. I benefit a lot. @
Follow @liyi193328 comment:
If this is my X_train: [ [ [1,1,1,1], [2,2,2,2], [3,3,3,3] ], [ [0,0,0,0], [0,0,0,0], [5,5,5,5] ], [ [0,0,0,0], [2,2,2,2], [3,3,3,3] ] ]
How should I structure my Y_train given each word (word-level training) will have it's own tag(Y and is also multi class). Is it like:
[ [1,2,3], [0,0,3], [0,2,1] ] ?
Because I am having an error: "Exception: All input arrays and the target array must have the same number of samples."
Thank you very much!
@ngopee For multi classes, if the label class is 2(total 3 classes), then it must be transformmed as 1D array [0,0,1]; Specifically if the sentence has x tokens. if every token has a label and has y classes, then all the labels's(Y_train) shape is (x,y), 2D array. May you solve problemes.
@liyi193328
Thank you very much for your reply!
Yes, I realised that later on but now having another issue which I'm not sure how to fix:
"if every token has a label and has y classes, then all the labels's(Y_train) shape is (x,y), 2D array". I'm not sure I follow this part. But below is what I have so far.
So, in my case, each token(Word vector) has is own tag.
Here is a sample of my input:
X_train =
[
[ 8496 1828 …5447]
[ 9096 8895 …13890]
[ 5775 115 … 15037]
[ 6782 9918 … 5048]
]
Y_train=
[
array([[ 0., 0., 0., 1.], [ 0., 0., 1., 0.], …[ 0., 0., 0., 1.]]),
array([[ 0., 0., 1., 0.], [ 0., 0., 0., 1.],…[ 0., 0., 1., 0.]]),
array([[ 0., 0., 1., 0.], [ 0., 0., 0., 1.], …[ 0., 0., 1., 0.]]),
array([[ 0., 1., 0., 0.], [ 0., 1., 0., 0.], …[ 0., 1., 0., 0.]])
]
I am getting this error:
AssertionError: Theano Assert failed!
Apply node that caused the error: Assert(Elemwise{Composite{(i0 - EQ(i1, i2))}}.0, Elemwise{eq,no_inplace}.0)
Inputs types: [TensorType(int8, matrix), TensorType(int8, scalar)]
Inputs shapes: [(1, 100), ()]
Inputs strides: [(100, 1), ()]
Inputs values: ['not shown', array(0, dtype=int8)]
Here is my code:
vocab_dim = 300 maxlen = 100 batch_size = 1 n_epoch = 2
print('Keras Model...')
model = Sequential() # or Graph or whatever
model.add(Embedding(output_dim=vocab_dim,
input_dim=n_symbols + 1,
mask_zero=True,
weights=[embedding_weights]))
model.add(LSTM(vocab_dim, return_sequences=True))
model.add(Dropout(0.3))
model.add(TimeDistributedDense(input_dim=vocab_dim, output_dim=1))
print('Compiling the Model...')
model.compile(loss='categorical_crossentropy',
optimizer='adam',
class_mode='categorical')
print("Train...")
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=n_epoch,
validation_data=(X_test, y_test), show_accuracy=True)
print("Evaluate...")
score, acc = model.evaluate(X_test, y_test,
batch_size=batch_size,
show_accuracy=True)
print('Test score:', score)
print('Test accuracy:', acc)
Thank you very much!
@ngopee model.add(TimeDistributedDense(input_dim=vocab_dim, output_dim=1)): output_dim == 1? I think the number is the number of classes, while I do not read the whole context.
@liyi193328
Thank you very much for pointing that out. That would have been yet another mistake.
However this did not seem to fix the issue I previously had. Any insight on what I could be doing wrong?
Thanks!
@ngoee Sorry for late reply! The meaning of your code is a many-to-one classification, but your actual goal is many-to-many. So it needs to change the logic.
@ngopee
What are the dimensions of your output/target?
@liyi193328 Yes, you are right. I thought the TimeDistributedDense layer is what I needed to make the model many to many ? This is what I understood by reading other Keras issues. Could you please explain to me what actually is the TimeDistributedDense layer or point me to any good reading material?
Is it possible for me to have only 1 LSTM layer and no more layer afterwards?
@dandxy89 I did a mistake above but my output dimension is 3. I have 3 classes.
Thank you very much for your replies! I appreciate it!
@ngopee https://github.com/fchollet/keras/issues/1029 may help you.
@liyi193328
Thank you very much! I got rid of the embedding layer and I'm feeding my pre-trained word vectors to the LSTM layer. Now it works!
Thanks again for the help!
Which is more efficient? Feeding the pre-trained word vectors directly or using the embedding layer with word vector weights? I would think that the two options are equivalent if you just set layer.trainable = False.
@PiranjaF are you asking in terms of performance of the model?
@viksit I think that the two approaches would be identical in terms of the final solution after X iterations, but have differences in training time. Is that true and which approach would then be the fastest?
So yes, I'm asking in terms of the training performance.
From my experiments, they were definitely off. I've seen better quality when taking global word2vec data and tuning it for a domain by using them as starting weights for an embedding layer. But less performant. Letting an embedding layer loose on your own domain data is faster, but quality wise (at least in my scenario), it suffered more.
But did you also set the weights for the embedding layer using the global word2vec and then turn off the trainable attribute for the embedding layer? That should make them identical I would guess.
@dandxy89: I tried replicating your script. The Best I got was Validation Accuracy: 0.7804, Loss: 0.4597 even after 30 epochs.
Any suggestions ? I have blatantly copied your code and ran it as it is. No where close to the results you reported: Validation Accuracy: 0.8485, Loss: 0.3442
@ngopee @farizrahman4u @viksit @liyi193328 @dandxy89
Can you pls share your piece of code where you directly fed the word-vectors to the model (without embedding layer) ?
all looks great to sum up may you pls recommend link to final best code for keras word2vec lstm pls, better for short texts and sentiment analysis, but actually any code for keras word2vec lstm Thank you very much in advance...
@anujgupta82 - I have modified my script and pushed to my repository. The modification I made was to increase the number of iterations when building the word2vec model. Also, I never got around directly feeding the vectors into the model instead of an embedding matrix. It should be a fairly small change in order to do that and quite interesting to see the difference.
25000/25000 [==============================] - 1321s - loss: 0.2803 - acc: 0.8849 - val_loss: 0.3220 - val_acc: 0.8609
Evaluate...
25000/25000 [==============================] - 320s
('Test score:', 0.32195980163097382)
('Test accuracy:', 0.86087999999999998)
@Sandy4321 - take a look at this. It may help.
@dandxy89
Ran your revised code, still got much lower numbers
Epoch 1/2
25000/25000 [==============================] - 224s - loss: 0.5040 - acc: 0.7463 - val_loss: 0.4709 - val_acc: 0.7625
Epoch 2/2
25000/25000 [==============================] - 224s - loss: 0.3768 - acc: 0.8274 - val_loss: 0.4571 - val_acc: 0.7810
Evaluate...
25000/25000 [==============================] - 37s
('Test score:', 0.4571321966743469)
('Test accuracy:', 0.78095999999999999)
Can someone else try his code and report the numbers u r getting ? @fchollet: any reasons for this ? ran the experiment couple of times to eliminate chance
@anujgupta82
Feeding the vectors directly:
model = Sequential()
model.add(LSTM(num_hidden_nodes , return_sequences=True, input_shape=(maxlen, vocab_dim)))
model.add(Dropout(0.2))
model.add(LSTM(num_hidden_nodes , return_sequences=True))
model.add(Dropout(0.2))
model.add(TimeDistributedDense(output_dim =nb_classes,input_dim= num_hidden_nodes , activation='softmax'))
@anujgupta82 what backend are you using? I used Tensorflow, not that I think that will have a significant impact.
@dandxy89: I am using theano (latest version) and running my code on AWS g2.2xlarge instance. My virtual environment is as follows:
backports.ssl-match-hostname==3.4.0.2 boto==2.39.0 bson==0.4.1 bz2file==0.98 certifi==2015.9.6.2 Cython==0.23.4 funcsigs==0.4 gensim==0.12.3 graphviz==0.4.10 h5py==2.5.0 httpretty==0.8.10 Keras==0.2.0 matplotlib==1.4.3 mock==1.3.0 nltk==3.1 nose==1.3.7 numpy==1.10.1 pandas==0.17.0 pbr==1.8.1 plotly==1.8.11 pydot==1.0.2 pymongo==3.0.3 pyparsing==1.5.7 python-dateutil==2.4.2 pytz==2015.7 PyYAML==3.11 requests==2.8.1 scikit-learn==0.16.1 scipy==0.16.0 seaborn==0.6.0 six==1.10.0 sklearn==0.0 smart-open==1.3.2 Theano==0.7.0 tornado==4.2.1 wheel==0.24.0
Will want to know what is causing the difference of results ? can you pls add your virtual environment as requirements.txt in your git repo. It will help me replicate things better. Menwhile can you once re-run your code with my environment
@ngopee : I will try your suggestion and get back to you soon
Using sudo pip install keras gensim -U
?
In the meantime I will rerun the code on Theano and let you know my result.
@dandxy89 : upgraded the packages and got the results
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/5
25000/25000 [==============================] - 211s - loss: 0.4418 - acc: 0.7908 - val_loss: 0.3645 - val_acc: 0.8428
Epoch 2/5
25000/25000 [==============================] - 211s - loss: 0.2836 - acc: 0.8842 - val_loss: 0.3214 - val_acc: 0.8624
Epoch 3/5
25000/25000 [==============================] - 212s - loss: 0.1818 - acc: 0.9320 - val_loss: 0.3684 - val_acc: 0.8576
Epoch 4/5
25000/25000 [==============================] - 212s - loss: 0.0992 - acc: 0.9650 - val_loss: 0.4339 - val_acc: 0.8570
Epoch 5/5
25000/25000 [==============================] - 211s - loss: 0.0580 - acc: 0.9800 - val_loss: 0.4965 - val_acc: 0.8512
Evaluate...
25000/25000 [==============================] - 57s
('Test score:', 0.49650939433217051)
('Test accuracy:', 0.85124)
Thank you so much for all the help
Anuj, May you share link to your code, pls It would be great to try On Mar 13, 2016 6:46 AM, "Anuj Gupta" notifications@github.com wrote:
@dandxy89 https://github.com/dandxy89 : upgraded the packages and got the results
Train... Train on 25000 samples, validate on 25000 samples Epoch 1/5 25000/25000 [==============================] - 211s - loss: 0.4418 - acc: 0.7908 - val_loss: 0.3645 - val_acc: 0.8428 Epoch 2/5 25000/25000 [==============================] - 211s - loss: 0.2836 - acc: 0.8842 - val_loss: 0.3214 - val_acc: 0.8624 Epoch 3/5 25000/25000 [==============================] - 212s - loss: 0.1818 - acc: 0.9320 - val_loss: 0.3684 - val_acc: 0.8576 Epoch 4/5 25000/25000 [==============================] - 212s - loss: 0.0992 - acc: 0.9650 - val_loss: 0.4339 - val_acc: 0.8570 Epoch 5/5 25000/25000 [==============================] - 211s - loss: 0.0580 - acc: 0.9800 - val_loss: 0.4965 - val_acc: 0.8512 Evaluate... 25000/25000 [==============================] - 57s
('Test score:', 0.49650939433217051) ('Test accuracy:', 0.85124)
Thank you so much for all the help
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-195932433.
Great! Looking at the results @anujgupta82 the model is over-fitting after two epochs.
@Sandy4321 the link I recommended above contains all the data and scripts.
Dan, I see thanks, By the way do you know links to other attempts to use word2vec for sentiment analysis, may be pure rnn with enforced improved optimisation like Rprop , since I have only CPU lap top , so some fast calculation code is needed Thanks Sandy On Mar 13, 2016 10:25 AM, "Dan Dixey" notifications@github.com wrote:
Great! Looking at the results @anujgupta82 https://github.com/anujgupta82 the model is over-fitting after two epochs.
@Sandy4321 https://github.com/Sandy4321 the link I recommended above contains all the data and scripts.
— Reply to this email directly or view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-195964330.
I am solving an NLP task and I am trying to model it directly as a sequence using different RNN flavors. How can I use my own Word Vectors rather than using an instance of layers.embeddings.Embedding?