nateraw / Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum
MIT License
107 stars 40 forks source link

Evaluating Learned Word Embeddings #59

Open dbl001 opened 5 years ago

dbl001 commented 5 years ago

Analogy is one of the properties of dense word embedding vectors. I evaluated 'analogy' based upon cosine-similarity distance metric on word embedding vectors learned from the twenty newsgroups dataset. I saved the word and topic vectors at the end of each epoch.

Here's my test code:

topic_embedding_199 = np.load("topic_weights_199.npy")
word_embedding_199 = np.load("word_weights_199.npy")

from lda2vec import utils, model
data_path  = "data/clean_data_twenty_newsgroups"
load_embeds = False

# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

def cosine_similarity(u, v):
    """
    Cosine similarity reflects the degree of similariy between u and v

    Arguments:
        u -- a word vector of shape (n,)          
        v -- a word vector of shape (n,)

    Returns:
        cosine_similarity -- the cosine similarity between u and v defined by the formula above.
    """

    distance = 0.0

    ### START CODE HERE ###
    # Compute the dot product between u and v (≈1 line)
    dot = np.dot(u, v);
    # Compute the L2 norm of u (≈1 line)
    norm_u = np.linalg.norm(u);

    # Compute the L2 norm of v (≈1 line)
    norm_v = np.linalg.norm(v);
    # Compute the cosine similarity defined by formula (1) (≈1 line)
    cosine_similarity = dot / norm_u / norm_v;
    ### END CODE HERE ###

    return cosine_similarity
def complete_analogy(word_a, word_b, word_c, word_to_vec_map):
    """
    Performs the word analogy task as explained above: a is to b as c is to ____. 

    Arguments:
    word_a -- a word, string
    word_b -- a word, string
    word_c -- a word, string
    word_to_vec_map -- dictionary that maps words to their corresponding vectors. 

    Returns:
    best_word --  the word such that v_b - v_a is close to v_best_word - v_c, as measured by cosine similarity
    """

    # convert words to lower case
    word_a, word_b, word_c = word_a.lower(), word_b.lower(), word_c.lower()

    ### START CODE HERE ###
    # Get the word embeddings v_a, v_b and v_c (≈1-3 lines)
    e_a, e_b, e_c = word_to_idx[word_a], word_to_idx[word_b], word_to_idx[word_c];
    ### END CODE HERE ###

    #words = word_to_vec_map.keys()
    words = word_to_idx.keys();
    max_cosine_sim = -100              # Initialize max_cosine_sim to a large negative number
    best_word = None                   # Initialize best_word with None, it will help keep track of the word to output

    # loop over the whole word vector set
    for w in words:        
        # to avoid best_word being one of the input words, pass on them.
        if w in [word_a, word_b, word_c] :
            continue

        ### START CODE HERE ###
        # Compute cosine similarity between the vector (e_b - e_a) and the vector ((w's vector representation) - e_c)  (≈1 line)
        cosine_sim = cosine_similarity(e_b - e_a, word_to_idx[w] - e_c);

        # If the cosine_sim is more than the max_cosine_sim seen so far,
            # then: set the new max_cosine_sim to the current cosine_sim and the best_word to the current word (≈3 lines)
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim;
            best_word = w;
        ### END CODE HERE ###

    return best_word

triads_to_try = [('king', 'man', 'lady'), ('man', 'woman', 'boy'), ('small', 'smaller', 'large')]
for triad in triads_to_try:
    print ('{} -> {} :: {} -> {}'.format( *triad, complete_analogy(*triad,word_embedding_199)))

king -> man :: lady -> x man -> woman :: boy -> deny small -> smaller :: large -> kind

This isn't what I expected ... I will test this using the GloVe embeddings.

Nearest embedding vector to topic:

idx = np.array([cosine_similarity(x, topic_embedding_199[9]) for x in word_embedding_199]).argmin()
print(idx_to_word[idx])

sure

... and what was 'learned' between epoch 198 and epoch 199 for topic #9

idx = np.array([cosine_similarity(x,-topic_embedding_198[9]) for x in word_embedding_199]).argmin()
print(idx_to_word[idx])

den