Closed hongkunyoo closed 8 years ago
Huh, so I looked and indeed the word for that line appears as ��������������� in the terminal. However, it still parsed correctly as space separated unicode and floats. I would remove it, but when I checked for other such words, I found that the distribution on first characters for the word tokens included many other non-alphanumeric terms. It appears there are a few thousand words beginning with non-printable ascii codes of the nearly 2 million words in the vocab. It's not ideal that we included these in the txt files, but I don't think they'll do much harm if you can parse them carefully. Is there something wrong with this word (other than it's uselessness in the dataset) that I missed?
It seems there is a integrity error in file Common Crawl -840B tokens(http://nlp.stanford.edu/data/glove.840B.300d.zip) in line number 140649 The word in that line is broken. Could you check this out?