Closed vindiesel closed 5 years ago
Is there any concensus of whether @sergeyf's approach still works? It does indeed appear that the weights argument has been removed, but it's still being used in the examples here..
Have you run this and made sure it fails?
https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html
Says updated for keras 2
On Nov 13, 2017 9:33 AM, "DomHudson" notifications@github.com wrote:
Is there any concensus of whether @sergeyf https://github.com/sergeyf's approach still works? It does indeed appear that the weights argument has been removed, but it's still being used in the examples here https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py#L122 ..
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-343936220, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKCOfLkfQgjIc873-xQZm1gVJ9UEWj5ks5s2FMegaJpZM4GRCom .
Hi everyone,
Looks like this is still a references for some people. Here is what I do now with Keras 2.0.8:
def set_embedding_layer_weights(embedding_layer, pretrained_embeddings):
dense_dim = pretrained_embeddings.shape[1]
weights = np.vstack((np.zeros(dense_dim), pretrained_embeddings))
embedding_layer.set_weights([weights])
# load up your pretrained_embeddings here
d = pretrained_embeddings.shape[1] # should be np.array
embedding_layer = Embedding(output_dim=d, input_dim=n_vocab, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
set_embedding_layer_weights(embedding_layer, pretrained_embeddings)
Note! This version assumes that the pretrained_embeddings
array does not come with a mask first row, and explicitly make an all-zeros row for it here: weights = np.vstack((np.zeros(dense_dim), pretrained_embeddings))
. If you already have a special mask row, then feel free to just do embedding_layer.set_weights([pretrained_embeddings])
Hope that helps.
Thanks for the reply both!
@AllardJM It doesn't error no. I've done a little investigation and with the following code I've come to the conclusion that it is being utilised and in fact, the original solution (using a weights
kwarg) that @sergeyf posted does still work.
import numpy as np
from keras import initializers
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential
weights = np.concatenate([
np.zeros((1, 100)), # Masking row: all zeros.
np.ones((1, 100)) # First word: all weights preset to 1.
])
print(weights)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
layer = Embedding(
output_dim = 100,
input_dim = 2,
mask_zero = True,
weights = [weights],
)
model = Sequential([
layer,
LSTM(2, dropout = 0.2, activation = 'tanh'),
Dense(1, activation = 'sigmoid')
])
model.compile(
optimizer = 'adam',
loss = 'binary_crossentropy',
metrics = []
)
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
The recent solution above also appears to work, but is probably more inefficient as it's initialising the weights and then overwriting them. This can be seen as shown:
layer = Embedding(
output_dim = 100,
input_dim = 2,
mask_zero = True
)
layer.build((None,))
print(layer.get_weights())
[array([[ -2.64064074e-02, -4.05902900e-02, -1.71032399e-02, 6.36395207e-03, 4.03554030e-02, -2.91514937e-02, -3.05371974e-02, 1.60062015e-02, -4.58858572e-02, -2.71607353e-03, -6.45029533e-04, -3.60430926e-02, -4.47065122e-02, -4.46958952e-02, 8.49759020e-03, -2.07597855e-02, -4.63474654e-02, -4.47412431e-02, .....
layer.set_weights([weights])
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
Great! This is what I found as well, the example code still works fine to pass in weights to initialize the embedding.
Cheers!
On 11/15/2017 08:05 AM, DomHudson wrote:
Thanks for the reply both!
@AllardJMhttps://github.com/allardjm It doesn't error no. I've done a little investigation and with the following code I've come to the conclusion that it is being utilised and in fact, the original solution (using a weights kwarg) that @sergeyfhttps://github.com/sergeyf posted does still work.
import numpy as np from keras import initializers from keras.layers import Embedding, LSTM, Dense from keras.models import Sequential
weights = np.concatenate([ np.zeros((1, 100)), # Masking row: all zeros. np.ones((1, 100)) # First word: all weights preset to 1. ]) print(weights)
array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
layer = Embedding( output_dim = 100, input_dim = 2, mask_zero = True, weights = [weights], )
model = Sequential([ layer, LSTM(2, dropout = 0.2, activation = 'tanh'), Dense(1, activation = 'sigmoid') ])
model.compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = [] )
print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
The recent solution above also appears to work, but is probably more inefficient as it's initialising the weights and then overwriting them. This can be seen as shown:
layer = Embedding( output_dim = 100, input_dim = 2, mask_zero = True ) layer.build((None,)) print(layer.get_weights())
[array([[ -2.64064074e-02, -4.05902900e-02, -1.71032399e-02, 6.36395207e-03, 4.03554030e-02, -2.91514937e-02, -3.05371974e-02, 1.60062015e-02, -4.58858572e-02, -2.71607353e-03, -6.45029533e-04, -3.60430926e-02, -4.47065122e-02, -4.46958952e-02, 8.49759020e-03, -2.07597855e-02, -4.63474654e-02, -4.47412431e-02, -7.12857256e-03, 3.30050252e-02, -1.70418713e-02, -3.46117802e-02, 1.63293723e-02, -3.06463335e-02, -3.92450131e-02, 2.13836078e-02, 3.40061374e-02, 3.08677852e-02, 4.10322733e-02, 3.48727070e-02, 3.77323031e-02, 4.75023203e-02, -4.60593663e-02, 4.89875488e-02, -1.86587516e-02, 1.37329465e-02, -1.24689462e-02, -2.74951141e-02, -2.39574052e-02, -4.11705412e-02, -2.67224889e-02, -1.86454095e-02, 9.51218046e-03, 1.30047565e-02, -1.28185125e-02, 1.50464000e-02, -3.25884894e-02, 1.06664898e-03, 3.91772352e-02, -4.15717773e-02, 3.98341119e-02, 1.08094336e-02, -2.93221492e-02, -3.67895775e-02, -1.90059599e-02, 3.34730162e-03, -2.74142250e-02, -4.06444333e-02, -4.97532897e-02, -4.81352210e-04, -2.15560924e-02, 4.51278277e-02, -2.36345585e-02, 4.39978205e-02, 2.73948014e-02, 4.52689640e-02, 1.53716626e-02, -2.49101524e-03, -8.96360632e-03, 3.06243300e-02, 4.95609641e-02, 2.66981137e-04, 3.92680196e-03, -2.85005327e-02, 4.53012399e-02, 3.41285653e-02, 4.43088599e-02, -1.21087050e-02, -3.81706282e-02, -3.51855792e-02, 3.59421670e-02, -3.01210601e-02, -3.23626027e-02, 4.94807661e-02, -1.53903933e-02, 9.66792088e-03, 1.23059156e-03, -1.84051401e-03, -3.88573073e-02, -3.77015956e-02, 4.48914282e-02, 3.49486731e-02, -4.73317020e-02, 1.45648129e-03, -2.16338988e-02, -3.01712025e-02, -4.01302688e-02, 1.65192429e-02, -3.59362774e-02, 2.93326676e-02], [ -4.46196795e-02, -2.18213685e-02, 2.72371471e-02, 4.23214212e-02, -3.41014937e-02, 4.29243445e-02, 3.27980518e-03, -4.80787531e-02, 3.40308845e-02, -6.82551879e-03, 4.03380400e-04, 4.45233956e-02, 4.18974236e-02, -1.88305825e-02, 7.91913306e-04, 4.96885180e-03, -1.89449489e-02, 3.14035825e-02, 4.15420346e-02, -3.21644135e-02, -3.54666486e-02, -3.17389816e-02, 2.59683859e-02, -3.76684554e-02, 4.51624401e-05, 4.44507562e-02, -4.96175438e-02, -4.82493974e-02, 4.00636811e-03, -4.86469679e-02, -2.88026463e-02, -4.70020436e-02, -1.23844091e-02, -1.96035542e-02, -4.45893705e-02, -2.10967846e-02, 4.90186326e-02, -1.49804656e-03, -3.46895168e-03, 3.20515819e-02, -3.41350446e-03, 3.22987102e-02, -3.75118107e-02, 3.66315842e-02, 6.32166862e-03, -2.67616995e-02, -2.28005182e-02, 3.59728225e-02, -1.14186527e-02, 6.25128765e-03, -1.01642106e-02, 1.16781592e-02, -3.82909179e-03, 3.07524931e-02, -3.32702114e-03, 1.29272817e-02, -4.88508958e-03, 4.88356426e-02, 3.67677584e-02, -1.22928238e-02, -5.73384156e-03, 2.96543725e-02, 4.05017398e-02, -9.28649586e-03, -2.95463633e-02, -4.89737280e-02, -4.42623487e-03, -4.81910333e-02, 8.44216347e-03, -4.26033465e-03, 2.13968400e-02, -2.50094850e-02, -4.68100868e-02, -1.76477917e-02, -1.68486964e-02, 1.41983628e-02, -3.38780954e-02, -3.14644054e-02, 4.16858196e-02, 4.50237580e-02, -6.27965620e-03, 5.43129456e-04, 3.63374949e-02, -1.94281098e-02, -3.25115174e-02, 3.43143530e-02, 4.91250828e-02, -4.51278165e-02, 4.32032421e-02, 3.06754243e-02, -1.41283274e-02, 4.49896120e-02, -3.07326354e-02, 4.95368838e-02, -8.92946147e-04, 3.42890918e-02, -1.97444838e-02, 3.26766376e-03, 3.58569697e-02, 4.43595164e-02]], dtype=float32)]
layer.set_weights([weights]) print(layer.get_weights())
[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fchollet/keras/issues/853#issuecomment-344586281, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIZkgnZmiRcmk9KcVAniAyspRySVNzz7ks5s2uG2gaJpZM4GRCom.
Is there standard code or a function that takes a model built in gensim word2vec and converts it into the dictionary format's (i.e. index_dict and word_vectors the first comment above)? Otherwise I will write my code for this but that seems much less efficient.
Thanks!
-- So, an example index_dict is the following:
{ 'yellow': 1, 'four': 2, 'woods': 3, 'ornate': 31, 'woody': 5, 'cyprus': 6, 'marching': 7, 'canes': 8, 'caned': 9, 'hermann': 10, 'lord': 11, 'meadows': 12, 'shaving': 13, 'swivel': 14 ... } And you also have a dictionary called word_vectors that maps words to vectors like so:
{ 'yellow': array([0.1,0.5,...,0.7]), 'four': array([0.2,1.2,...,0.9]), ... }
@aksg87 You could use the gensim.models.keyedvectors.KeyedVectors.get_keras_embedding
method?
The KeyedVectors
instance is accessible from a Word2Vec
instance via the wv
attribute, for example:
model = Word2Vec.load(fname)
embedding_layer = model.wv.get_keras_embedding(train_embeddings=True)
Source: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L1048
Thank you so much for your reply. I ended up finding some examples and wrote it out:
I'll have to try the version you provided.
Also, if you set train_embeddings=True the weights in the layer will change from the word2vect output is this advisable in general?
The code I wrote to do this:
embeddings_index = dict() f = open('vectors.txt') for line in f: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Loaded %s word vectors.' % len(embeddings_index))
dim_len = len(coefs) print('Dimension of vector %s.' % dim_len)
embedding_matrix = zeros((vocab_size, dim_len)) for word, i in tqdm(t.word_index.items()): embedding_vector = embeddings_index.get(word)
if embedding_vector is not None and np.shape(embedding_vector) != (202,):
embedding_matrix[i] = embedding_vector
if np.shape(embedding_vector) == (202,):
print(i)
print("embedding_vector", np.shape(embedding_vector))
print("embedding_matrix", np.shape(embedding_matrix[i]))
Another question I have is my final output is a softmax prediction on several classes (396 to be exact).
The output vector is messy (see below).
Is their a clean way to both 1) convert this into the top 3 labels predicted and 2) write a custom accuracy function which checks how often the softmax predicts the top 3?
array([ 2.74735111e-22, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.84925198e-38, 0.00000000e+00, 1.72161353e-34, 1.86862336e-26, 6.87889553e-07, 1.09056833e-04, 1.17705227e-26, 6.17638065e-08, 6.54662412e-23, 3.28686365e-05, 4.67332768e-08, 0.00000000e+00, 5.22176857e-10, 4.09760102e-38, 0.00000000e+00, 5.86631461e-17, 1.14025260e-08, 4.42352757e-07, 8.37238900e-08, 0.00000000e+00, 1.48040133e-14, 3.42079135e-14, 2.47516301e-20, ...
Also, if you set train_embeddings=True the weights in the layer will change from the word2vect output is this advisable in general?
I don't think there's a 'correct' answer to this - it's up to you and the problem you're modelling. By having a trainable embeddings layer the weights will be tuned for the model's NLP task. This will give you more domain specific weights at the cost of increased training time.
It's quite common to train initial weights on a large corpus (or to use a pre-trained third party model) and then use that to seed your embedding layer. In this case you will likely find benefit if you do train the embeddings layer with the model. However, if you've trained your own Word2Vec model on exactly the domain you're modelling, you may find that the difference in results is negligible and that training the layer is not preferential over a shorter training time.
Is their a clean way to convert this into the top 3 labels predicted
To do this you could use numpy's argpartition
method.
>>> predictions = np.array([0.1, 0.3, 0.2, 0.4, 0.5])
>>> top_three_classes = np.argpartition(predictions, -3)[-3:]
>>> top_three_classes
array([1, 3, 4])
Write a custom accuracy function which checks how often the softmax predicts the top 3?
Yes this should be fairly straightforward utilising the above logic and a custom metric class or function.
To calculate accuracy, I created a few functions and used Map to apply them on my prediction which essentially tell me how often my model's 'Top 3' prediction contains the true answer. (At the very end I basically counted 'True' vs 'False' to arrive at a percentage. I thought Keras might have a way to overwrite their Accuracy function but didn't see a way.)
I am solving an NLP task and I am trying to model it directly as a sequence using different RNN flavors. How can I use my own Word Vectors rather than using an instance of layers.embeddings.Embedding?