consider using the Stanford Tokenizer for glove. in their paper they say "We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words" and "For the model trained on Common Crawl data, we use a larger vocabulary of about 2 million words"
could give us more glove embeddings if it tokenizes more words
consider using the Stanford Tokenizer for glove. in their paper they say "We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words" and "For the model trained on Common Crawl data, we use a larger vocabulary of about 2 million words"
could give us more glove embeddings if it tokenizes more words