Closed mattmacy closed 7 years ago
Yes, this behavior is detailed at https://github.com/stanfordnlp/GloVe/blob/master/src/README.md. I have updated the wording a little to make it more clear. We recommend using the Stanford Tokenizer first, although there are plenty of other options.
Thanks
I'm looking at the vocab.txt that is generated from running over the latest itwiki dump with xml tags stripped by a WikiExtractor modified to not insert doc tags. It doesn't seem to correctly separate out punctuation. I'm getting 10 different variations of the word "zia" (aunt). I must be doing something wrong, was I expected to pre-process the corpus?
:~/Downloads/GloVe-1.2$ grep ^zia vocab.txt zia 4021 zia, 683 zia. 274 zia" 30 zia: 13 zia; 12 zia) 10 zia", 8 zia), 5 zia). 5