tokenization not separating punctuation from words

stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Apache License 2.0

6.86k stars 1.51k forks source link

tokenization not separating punctuation from words #55

Closed mattmacy closed 7 years ago

mattmacy commented 7 years ago

I'm looking at the vocab.txt that is generated from running over the latest itwiki dump with xml tags stripped by a WikiExtractor modified to not insert doc tags. It doesn't seem to correctly separate out punctuation. I'm getting 10 different variations of the word "zia" (aunt). I must be doing something wrong, was I expected to pre-process the corpus?

:~/Downloads/GloVe-1.2$ grep ^zia vocab.txt zia 4021 zia, 683 zia. 274 zia" 30 zia: 13 zia; 12 zia) 10 zia", 8 zia), 5 zia). 5

ghost commented 7 years ago

Yes, this behavior is detailed at https://github.com/stanfordnlp/GloVe/blob/master/src/README.md. I have updated the wording a little to make it more clear. We recommend using the Stanford Tokenizer first, although there are plenty of other options.

mattmacy commented 7 years ago

Thanks