hassonlab / 247-pickling

Contains code to create pickles from raw/processed data
1 stars 9 forks source link

glove tokenizer #157

Open zkokaja opened 1 year ago

zkokaja commented 1 year ago

consider using the Stanford Tokenizer for glove. in their paper they say "We tokenize and lowercase each corpus with the Stanford tokenizer, build a vocabulary of the 400,000 most frequent words" and "For the model trained on Common Crawl data, we use a larger vocabulary of about 2 million words"

could give us more glove embeddings if it tokenizes more words

baubrey commented 1 year ago

https://www.nltk.org/_modules/nltk/tokenize/stanford.html

zkokaja commented 1 year ago

seems like the java jar is required, and can be downloaded from from https://nlp.stanford.edu/software/lex-parser.html#Download