stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.86k stars 1.51k forks source link

Dataset Error #17

Closed hongkunyoo closed 8 years ago

hongkunyoo commented 8 years ago

It seems there is a integrity error in file Common Crawl -840B tokens(http://nlp.stanford.edu/data/glove.840B.300d.zip) in line number 140649 The word in that line is broken. Could you check this out?

ghost commented 8 years ago

Huh, so I looked and indeed the word for that line appears as ��������������� in the terminal. However, it still parsed correctly as space separated unicode and floats. I would remove it, but when I checked for other such words, I found that the distribution on first characters for the word tokens included many other non-alphanumeric terms. It appears there are a few thousand words beginning with non-printable ascii codes of the nearly 2 million words in the vocab. It's not ideal that we included these in the txt files, but I don't think they'll do much harm if you can parse them carefully. Is there something wrong with this word (other than it's uselessness in the dataset) that I missed?