vindiesel commented 8 years ago

I am solving an NLP task and I am trying to model it directly as a sequence using different RNN flavors. How can I use my own Word Vectors rather than using an instance of layers.embeddings.Embedding?

DomHudson commented 6 years ago

Is there any concensus of whether @sergeyf's approach still works? It does indeed appear that the weights argument has been removed, but it's still being used in the examples here..

AllardJM commented 6 years ago

Have you run this and made sure it fails?

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

Says updated for keras 2

On Nov 13, 2017 9:33 AM, "DomHudson" notifications@github.com wrote:

Is there any concensus of whether @sergeyf https://github.com/sergeyf's approach still works? It does indeed appear that the weights argument has been removed, but it's still being used in the examples here https://github.com/fchollet/keras/blob/master/examples/pretrained_word_embeddings.py#L122 ..

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/fchollet/keras/issues/853#issuecomment-343936220, or mute the thread https://github.com/notifications/unsubscribe-auth/AHKCOfLkfQgjIc873-xQZm1gVJ9UEWj5ks5s2FMegaJpZM4GRCom .

sergeyf commented 6 years ago

Hi everyone,

Looks like this is still a references for some people. Here is what I do now with Keras 2.0.8:

def set_embedding_layer_weights(embedding_layer, pretrained_embeddings):
    dense_dim = pretrained_embeddings.shape[1]
    weights = np.vstack((np.zeros(dense_dim), pretrained_embeddings))
    embedding_layer.set_weights([weights])

# load up your pretrained_embeddings here 
d = pretrained_embeddings.shape[1] # should be np.array
embedding_layer = Embedding(output_dim=d, input_dim=n_vocab, trainable=True)
embedding_layer.build((None,)) # if you don't do this, the next step won't work
set_embedding_layer_weights(embedding_layer, pretrained_embeddings)

Note! This version assumes that the pretrained_embeddings array does not come with a mask first row, and explicitly make an all-zeros row for it here: weights = np.vstack((np.zeros(dense_dim), pretrained_embeddings)). If you already have a special mask row, then feel free to just do embedding_layer.set_weights([pretrained_embeddings])

Hope that helps.

DomHudson commented 6 years ago

Thanks for the reply both!

@AllardJM It doesn't error no. I've done a little investigation and with the following code I've come to the conclusion that it is being utilised and in fact, the original solution (using a weights kwarg) that @sergeyf posted does still work.

import numpy as np
from keras import initializers
from keras.layers import Embedding, LSTM, Dense
from keras.models import Sequential

weights = np.concatenate([
    np.zeros((1, 100)), # Masking row: all zeros.
    np.ones((1, 100)) # First word: all weights preset to 1.
]) 
print(weights)

array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

layer = Embedding(
    output_dim = 100,
    input_dim = 2,
    mask_zero = True,
    weights = [weights],
)

model = Sequential([
    layer,
    LSTM(2, dropout = 0.2, activation = 'tanh'),
    Dense(1, activation = 'sigmoid')
])

model.compile(
    optimizer = 'adam',
    loss = 'binary_crossentropy',
    metrics = []
)

print(layer.get_weights())

[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]

The recent solution above also appears to work, but is probably more inefficient as it's initialising the weights and then overwriting them. This can be seen as shown:

layer = Embedding(
    output_dim = 100,
    input_dim = 2,
    mask_zero = True
)
layer.build((None,))
print(layer.get_weights())

[array([[ -2.64064074e-02, -4.05902900e-02, -1.71032399e-02, 6.36395207e-03, 4.03554030e-02, -2.91514937e-02, -3.05371974e-02, 1.60062015e-02, -4.58858572e-02, -2.71607353e-03, -6.45029533e-04, -3.60430926e-02, -4.47065122e-02, -4.46958952e-02, 8.49759020e-03, -2.07597855e-02, -4.63474654e-02, -4.47412431e-02, .....

layer.set_weights([weights])
print(layer.get_weights())

[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]

BrianMiner commented 6 years ago

Great! This is what I found as well, the example code still works fine to pass in weights to initialize the embedding.

Cheers!

On 11/15/2017 08:05 AM, DomHudson wrote:

Thanks for the reply both!

@AllardJMhttps://github.com/allardjm It doesn't error no. I've done a little investigation and with the following code I've come to the conclusion that it is being utilised and in fact, the original solution (using a weights kwarg) that @sergeyfhttps://github.com/sergeyf posted does still work.

import numpy as np from keras import initializers from keras.layers import Embedding, LSTM, Dense from keras.models import Sequential

weights = np.concatenate([ np.zeros((1, 100)), # Masking row: all zeros. np.ones((1, 100)) # First word: all weights preset to 1. ]) print(weights)

array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

layer = Embedding( output_dim = 100, input_dim = 2, mask_zero = True, weights = [weights], )

model = Sequential([ layer, LSTM(2, dropout = 0.2, activation = 'tanh'), Dense(1, activation = 'sigmoid') ])

model.compile( optimizer = 'adam', loss = 'binary_crossentropy', metrics = [] )

print(layer.get_weights())

[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]

The recent solution above also appears to work, but is probably more inefficient as it's initialising the weights and then overwriting them. This can be seen as shown:

layer = Embedding( output_dim = 100, input_dim = 2, mask_zero = True ) layer.build((None,)) print(layer.get_weights())

[array([[ -2.64064074e-02, -4.05902900e-02, -1.71032399e-02, 6.36395207e-03, 4.03554030e-02, -2.91514937e-02, -3.05371974e-02, 1.60062015e-02, -4.58858572e-02, -2.71607353e-03, -6.45029533e-04, -3.60430926e-02, -4.47065122e-02, -4.46958952e-02, 8.49759020e-03, -2.07597855e-02, -4.63474654e-02, -4.47412431e-02, -7.12857256e-03, 3.30050252e-02, -1.70418713e-02, -3.46117802e-02, 1.63293723e-02, -3.06463335e-02, -3.92450131e-02, 2.13836078e-02, 3.40061374e-02, 3.08677852e-02, 4.10322733e-02, 3.48727070e-02, 3.77323031e-02, 4.75023203e-02, -4.60593663e-02, 4.89875488e-02, -1.86587516e-02, 1.37329465e-02, -1.24689462e-02, -2.74951141e-02, -2.39574052e-02, -4.11705412e-02, -2.67224889e-02, -1.86454095e-02, 9.51218046e-03, 1.30047565e-02, -1.28185125e-02, 1.50464000e-02, -3.25884894e-02, 1.06664898e-03, 3.91772352e-02, -4.15717773e-02, 3.98341119e-02, 1.08094336e-02, -2.93221492e-02, -3.67895775e-02, -1.90059599e-02, 3.34730162e-03, -2.74142250e-02, -4.06444333e-02, -4.97532897e-02, -4.81352210e-04, -2.15560924e-02, 4.51278277e-02, -2.36345585e-02, 4.39978205e-02, 2.73948014e-02, 4.52689640e-02, 1.53716626e-02, -2.49101524e-03, -8.96360632e-03, 3.06243300e-02, 4.95609641e-02, 2.66981137e-04, 3.92680196e-03, -2.85005327e-02, 4.53012399e-02, 3.41285653e-02, 4.43088599e-02, -1.21087050e-02, -3.81706282e-02, -3.51855792e-02, 3.59421670e-02, -3.01210601e-02, -3.23626027e-02, 4.94807661e-02, -1.53903933e-02, 9.66792088e-03, 1.23059156e-03, -1.84051401e-03, -3.88573073e-02, -3.77015956e-02, 4.48914282e-02, 3.49486731e-02, -4.73317020e-02, 1.45648129e-03, -2.16338988e-02, -3.01712025e-02, -4.01302688e-02, 1.65192429e-02, -3.59362774e-02, 2.93326676e-02], [ -4.46196795e-02, -2.18213685e-02, 2.72371471e-02, 4.23214212e-02, -3.41014937e-02, 4.29243445e-02, 3.27980518e-03, -4.80787531e-02, 3.40308845e-02, -6.82551879e-03, 4.03380400e-04, 4.45233956e-02, 4.18974236e-02, -1.88305825e-02, 7.91913306e-04, 4.96885180e-03, -1.89449489e-02, 3.14035825e-02, 4.15420346e-02, -3.21644135e-02, -3.54666486e-02, -3.17389816e-02, 2.59683859e-02, -3.76684554e-02, 4.51624401e-05, 4.44507562e-02, -4.96175438e-02, -4.82493974e-02, 4.00636811e-03, -4.86469679e-02, -2.88026463e-02, -4.70020436e-02, -1.23844091e-02, -1.96035542e-02, -4.45893705e-02, -2.10967846e-02, 4.90186326e-02, -1.49804656e-03, -3.46895168e-03, 3.20515819e-02, -3.41350446e-03, 3.22987102e-02, -3.75118107e-02, 3.66315842e-02, 6.32166862e-03, -2.67616995e-02, -2.28005182e-02, 3.59728225e-02, -1.14186527e-02, 6.25128765e-03, -1.01642106e-02, 1.16781592e-02, -3.82909179e-03, 3.07524931e-02, -3.32702114e-03, 1.29272817e-02, -4.88508958e-03, 4.88356426e-02, 3.67677584e-02, -1.22928238e-02, -5.73384156e-03, 2.96543725e-02, 4.05017398e-02, -9.28649586e-03, -2.95463633e-02, -4.89737280e-02, -4.42623487e-03, -4.81910333e-02, 8.44216347e-03, -4.26033465e-03, 2.13968400e-02, -2.50094850e-02, -4.68100868e-02, -1.76477917e-02, -1.68486964e-02, 1.41983628e-02, -3.38780954e-02, -3.14644054e-02, 4.16858196e-02, 4.50237580e-02, -6.27965620e-03, 5.43129456e-04, 3.63374949e-02, -1.94281098e-02, -3.25115174e-02, 3.43143530e-02, 4.91250828e-02, -4.51278165e-02, 4.32032421e-02, 3.06754243e-02, -1.41283274e-02, 4.49896120e-02, -3.07326354e-02, 4.95368838e-02, -8.92946147e-04, 3.42890918e-02, -1.97444838e-02, 3.26766376e-03, 3.58569697e-02, 4.43595164e-02]], dtype=float32)]

layer.set_weights([weights]) print(layer.get_weights())

[array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=float32)]

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fchollet/keras/issues/853#issuecomment-344586281, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AIZkgnZmiRcmk9KcVAniAyspRySVNzz7ks5s2uG2gaJpZM4GRCom.

aksg87 commented 6 years ago

Is there standard code or a function that takes a model built in gensim word2vec and converts it into the dictionary format's (i.e. index_dict and word_vectors the first comment above)? Otherwise I will write my code for this but that seems much less efficient.

Thanks!

-- So, an example index_dict is the following:

{ 'yellow': 1, 'four': 2, 'woods': 3, 'ornate': 31, 'woody': 5, 'cyprus': 6, 'marching': 7, 'canes': 8, 'caned': 9, 'hermann': 10, 'lord': 11, 'meadows': 12, 'shaving': 13, 'swivel': 14 ... } And you also have a dictionary called word_vectors that maps words to vectors like so:

{ 'yellow': array([0.1,0.5,...,0.7]), 'four': array([0.2,1.2,...,0.9]), ... }

DomHudson commented 6 years ago

@aksg87 You could use the gensim.models.keyedvectors.KeyedVectors.get_keras_embedding method?

The KeyedVectors instance is accessible from a Word2Vec instance via the wv attribute, for example:

model = Word2Vec.load(fname)
embedding_layer = model.wv.get_keras_embedding(train_embeddings=True)

Source: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L1048

aksg87 commented 6 years ago

Thank you so much for your reply. I ended up finding some examples and wrote it out:

I'll have to try the version you provided.

Also, if you set train_embeddings=True the weights in the layer will change from the word2vect output is this advisable in general?

The code I wrote to do this:

load the whole embedding into memory

embeddings_index = dict() f = open('vectors.txt') for line in f: values = line.split() word = values[0] coefs = asarray(values[1:], dtype='float32') embeddings_index[word] = coefs f.close() print('Loaded %s word vectors.' % len(embeddings_index))

dim_len = len(coefs) print('Dimension of vector %s.' % dim_len)

create a weight matrix for words in training docs

embedding_matrix = zeros((vocab_size, dim_len)) for word, i in tqdm(t.word_index.items()): embedding_vector = embeddings_index.get(word)

if embedding_vector is not None and np.shape(embedding_vector) != (202,):
    embedding_matrix[i] = embedding_vector      
if np.shape(embedding_vector) == (202,):
    print(i)
    print("embedding_vector", np.shape(embedding_vector))
    print("embedding_matrix", np.shape(embedding_matrix[i]))

aksg87 commented 6 years ago

Another question I have is my final output is a softmax prediction on several classes (396 to be exact).

The output vector is messy (see below).

Is their a clean way to both 1) convert this into the top 3 labels predicted and 2) write a custom accuracy function which checks how often the softmax predicts the top 3?

array([ 2.74735111e-22, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.84925198e-38, 0.00000000e+00, 1.72161353e-34, 1.86862336e-26, 6.87889553e-07, 1.09056833e-04, 1.17705227e-26, 6.17638065e-08, 6.54662412e-23, 3.28686365e-05, 4.67332768e-08, 0.00000000e+00, 5.22176857e-10, 4.09760102e-38, 0.00000000e+00, 5.86631461e-17, 1.14025260e-08, 4.42352757e-07, 8.37238900e-08, 0.00000000e+00, 1.48040133e-14, 3.42079135e-14, 2.47516301e-20, ...

DomHudson commented 6 years ago

Also, if you set train_embeddings=True the weights in the layer will change from the word2vect output is this advisable in general?

I don't think there's a 'correct' answer to this - it's up to you and the problem you're modelling. By having a trainable embeddings layer the weights will be tuned for the model's NLP task. This will give you more domain specific weights at the cost of increased training time.

It's quite common to train initial weights on a large corpus (or to use a pre-trained third party model) and then use that to seed your embedding layer. In this case you will likely find benefit if you do train the embeddings layer with the model. However, if you've trained your own Word2Vec model on exactly the domain you're modelling, you may find that the difference in results is negligible and that training the layer is not preferential over a shorter training time.

Is their a clean way to convert this into the top 3 labels predicted

To do this you could use numpy's argpartition method.

>>> predictions = np.array([0.1, 0.3, 0.2, 0.4, 0.5])
>>> top_three_classes = np.argpartition(predictions, -3)[-3:]
>>> top_three_classes
array([1, 3, 4])

Write a custom accuracy function which checks how often the softmax predicts the top 3?

Yes this should be fairly straightforward utilising the above logic and a custom metric class or function.

aksg87 commented 6 years ago

Thank so much for your reply! I discovered the np.argpartition function soon after my post and it worked perfectly.

To calculate accuracy, I created a few functions and used Map to apply them on my prediction which essentially tell me how often my model's 'Top 3' prediction contains the true answer. (At the very end I basically counted 'True' vs 'False' to arrive at a percentage. I thought Keras might have a way to overwrite their Accuracy function but didn't see a way.)

Now, I am now incorporating multiple inputs into the model (but will apply aggressive dropout to all of them so the model should work on even some of them). What I would love to have, is the model assume a blank input or perhaps a vector of zeros, if no input is provided during the model.predict instead of throwing an error. Is their any way to do this or should I hardcode a vector of zeros if no input is provided? -- Thanks again for all the awesome feedback.

keras-team / keras

Incorporating Word Vectors rather than using an Embedding class #853

load the whole embedding into memory

create a weight matrix for words in training docs