stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.87k stars 1.52k forks source link

Corpus token occurrence counts #115

Open rahulsmehta opened 6 years ago

rahulsmehta commented 6 years ago

Hi,

After checking on the GloVe home page (https://nlp.stanford.edu/projects/glove/) and specifically looking at the corpus (Wikipedia 2014 + Gigaword 5) for the 6B version, I was wondering if there was a table/summary of the number of times each token occurs in the corpus.

Thanks!

npeirson commented 6 years ago

Hi! Just popping in to say that (although I don't know the answer to your question) if no such summary yet exists, it would be a wonderful contribution to the research community, and could probably be achieved using NLTK or similar, and could be done on the cloud to support the dataset size. I'd be happy to volunteer some time on it as well, if it's something that hasn't been done.

alvations commented 5 years ago

I'm also looking for the token count information for GloVe.

Anyone know what is the exact source of the pre-processed text? (Esp. for the version of the common crawl) Only with the same pre-processing text can we get the same tokens and their respective corpus counts.