Dataset Error - Githubissues

Huh, so I looked and indeed the word for that line appears as �� in the terminal. However, it still parsed correctly as space separated unicode and floats. I would remove it, but when I checked for other such words, I found that the distribution on first characters for the word tokens included many other non-alphanumeric terms. It appears there are a few thousand words beginning with non-printable ascii codes of the nearly 2 million words in the vocab. It's not ideal that we included these in the txt files, but I don't think they'll do much harm if you can parse them carefully. Is there something wrong with this word (other than it's uselessness in the dataset) that I missed?

stanfordnlp / GloVe

Dataset Error #17