Closed prabathbr closed 3 years ago
In some of these examples with more than 301 pieces after split(), there's actually an NBSP in between two text words. The examples which apparently have less than 300 are even more exciting - the text is actually entirely composed of NBSP. Whether or not that is particularly useful, I don't know, but the point is, you should get exactly 300 dim vectors if you do split(" ") rather than just doing split(). At any rate, with 2.1M words in the file, you're probably not losing much if you throw out the weird outliers.
Thank you very much. "values = line.split(" ")" solved the issue as you mentioned.
I have downloaded pre-trained "Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors" data file from https://nlp.stanford.edu/data/glove.840B.300d.zip
While inspecting the file, which should contain 300d for each word, I have found some of the words have missing or extra vectors. Specially, there are a couple of words which has an email address as the first vector weight.
I have used this python script for the verification:
Output: Key for each line
<number of vectors for the word> --> <word> ---> <first vector of the word>