Analogy is one of the properties of dense word embedding vectors.
I evaluated 'analogy' based upon cosine-similarity distance metric on word embedding vectors learned from the twenty newsgroups dataset.
I saved the word and topic vectors at the end of each epoch.
Here's my test code:
topic_embedding_199 = np.load("topic_weights_199.npy")
word_embedding_199 = np.load("word_weights_199.npy")
from lda2vec import utils, model
data_path = "data/clean_data_twenty_newsgroups"
load_embeds = False
# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
target_ids, doc_ids) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)
def cosine_similarity(u, v):
"""
Cosine similarity reflects the degree of similariy between u and v
Arguments:
u -- a word vector of shape (n,)
v -- a word vector of shape (n,)
Returns:
cosine_similarity -- the cosine similarity between u and v defined by the formula above.
"""
distance = 0.0
### START CODE HERE ###
# Compute the dot product between u and v (≈1 line)
dot = np.dot(u, v);
# Compute the L2 norm of u (≈1 line)
norm_u = np.linalg.norm(u);
# Compute the L2 norm of v (≈1 line)
norm_v = np.linalg.norm(v);
# Compute the cosine similarity defined by formula (1) (≈1 line)
cosine_similarity = dot / norm_u / norm_v;
### END CODE HERE ###
return cosine_similarity
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
"""
Performs the word analogy task as explained above: a is to b as c is to ____.
Arguments:
word_a -- a word, string
word_b -- a word, string
word_c -- a word, string
word_to_vec_map -- dictionary that maps words to their corresponding vectors.
Returns:
best_word -- the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
"""
# convert words to lower case
word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()
### START CODE HERE ###
# Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
e_a, e_b, e_c = word_to_idx[word_a], word_to_idx[word_b], word_to_idx[word_c];
### END CODE HERE ###
#words = word_to_vec_map.keys()
words = word_to_idx.keys();
max_cosine_sim = -100 # Initialize max_cosine_sim to a large negative number
best_word = None # Initialize best_word with None, it will help keep track of the word to output
# loop over the whole word vector set
for w in words:
# to avoid best_word being one of the input words, pass on them.
if w in [word_a, word_b, word_c] :
continue
### START CODE HERE ###
# Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c) (≈1 line)
cosine_sim = cosine_similarity(e_b - e_a, word_to_idx[w] - e_c);
# If the cosine_sim is more than the max_cosine_sim seen so far,
# then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
if cosine_sim > max_cosine_sim:
max_cosine_sim = cosine_sim;
best_word = w;
### END CODE HERE ###
return best_word
triads_to_try = [('king', 'man', 'lady'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_embedding_199)))
king -> man :: lady -> x
man -> woman :: boy -> deny
small -> smaller :: large -> kind
This isn't what I expected ... I will test this using the GloVe embeddings.
Nearest embedding vector to topic:
idx = np.array([cosine_similarity(x, topic_embedding_199[9]) for x in word_embedding_199]).argmin()
print(idx_to_word[idx])
sure
... and what was 'learned' between epoch 198 and epoch 199 for topic #9
idx = np.array([cosine_similarity(x,-topic_embedding_198[9]) for x in word_embedding_199]).argmin()
print(idx_to_word[idx])
Analogy is one of the properties of dense word embedding vectors. I evaluated 'analogy' based upon cosine-similarity distance metric on word embedding vectors learned from the twenty newsgroups dataset. I saved the word and topic vectors at the end of each epoch.
Here's my test code:
king -> man :: lady -> x man -> woman :: boy -> deny small -> smaller :: large -> kind
This isn't what I expected ... I will test this using the GloVe embeddings.
Nearest embedding vector to topic:
sure
... and what was 'learned' between epoch 198 and epoch 199 for topic #9
den