stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Pre-trained text file contains missing or extra vectors for some words (keys) #191

Closed prabathbr closed 3 years ago

prabathbr commented 3 years ago

I have downloaded pre-trained "Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors" data file from https://nlp.stanford.edu/data/glove.840B.300d.zip

While inspecting the file, which should contain 300d for each word, I have found some of the words have missing or extra vectors. Specially, there are a couple of words which has an email address as the first vector weight.

I have used this python script for the verification:

%%time
import numpy as np

glove_file_name = "glove.840B.300d.txt"

file = open(glove_file_name, encoding='utf-8')

for line in file:
    values = line.split()
    word = values[0]
    vector = np.asarray(values[1: ])

    if len(vector) != 300:
        print(len(vector),"-->",word,"--->",vector[0])

file.close()

Output: Key for each line <number of vectors for the word> --> <word> ---> <first vector of the word>

302 --> . ---> .
301 --> at ---> name@domain.com
299 --> 0.20785 ---> 0.2703
304 --> . ---> .
301 --> to ---> name@domain.com
301 --> . ---> .
303 --> . ---> .
301 --> email ---> name@domain.com
299 --> 0.39511 ---> 0.37458
301 --> or ---> name@domain.com
299 --> 0.13211 ---> 0.19999
301 --> contact ---> name@domain.com
299 --> -0.38024 ---> 0.61431
299 --> -0.0033421 ---> 0.4899
301 --> Email ---> name@domain.com
301 --> on ---> name@domain.com
299 --> 0.14608 ---> 0.31513
299 --> -0.36288 ---> -0.075749
301 --> At ---> Killerseats.com
301 --> by ---> name@domain.com
301 --> in ---> mylot.com
299 --> 0.5478 ---> 0.18474
301 --> emailing ---> name@domain.com
301 --> Contact ---> name@domain.com
299 --> 0.59759 ---> -0.64012
301 --> at ---> name@domain.com
301 --> • ---> name@domain.com
301 --> at ---> Amazon.com
301 --> is ---> name@domain.com
Wall time: 2min 50s
AngledLuffa commented 3 years ago

In some of these examples with more than 301 pieces after split(), there's actually an NBSP in between two text words. The examples which apparently have less than 300 are even more exciting - the text is actually entirely composed of NBSP. Whether or not that is particularly useful, I don't know, but the point is, you should get exactly 300 dim vectors if you do split(" ") rather than just doing split(). At any rate, with 2.1M words in the file, you're probably not losing much if you throw out the weird outliers.

prabathbr commented 3 years ago

Thank you very much. "values = line.split(" ")" solved the issue as you mentioned.