Add GloVe pretrained models from CommonCrawl corpus

havingfun commented 4 years ago

Hi Team,

I see that we don't have two of the models from the pretrained models by Stanford from here - https://nlp.stanford.edu/projects/glove/ The ones that can be added are -

Common Crawl (42B tokens, 1.9M vocab, uncased, 300d vectors, 1.75 GB download): glove.42B.300d.zip
Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip

Thanks, Rajesh

kevinmneal commented 2 years ago

Resurrecting this. These models have enormous vocabs that could prove useful for more esoteric problems, would love to be able to use them easily.

piskvorky commented 2 years ago

Sure, why not. I'm +1 on including those.

Please check https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model; we'll need:

a) Text that motivates adding each model (should be easy), including any links to its original research and preprocessing options, its license etc. Basically a quick summary of "What is this?' and "Who is it for?" b) Code that loads these models (to include in __init__.py; see e.g. fasttext-wiki-news-subwords-300). Again, should be easy, IIRC we already support the gloVe data format.

Cheers!

piskvorky / gensim-data

Add GloVe pretrained models from CommonCrawl corpus #40