piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

Add GloVe pretrained models from CommonCrawl corpus #40

Open havingfun opened 4 years ago

havingfun commented 4 years ago

Hi Team,

I see that we don't have two of the models from the pretrained models by Stanford from here - https://nlp.stanford.edu/projects/glove/ The ones that can be added are -

Thanks, Rajesh

kevinmneal commented 2 years ago

Resurrecting this. These models have enormous vocabs that could prove useful for more esoteric problems, would love to be able to use them easily.

piskvorky commented 2 years ago

Sure, why not. I'm +1 on including those.

Please check https://github.com/RaRe-Technologies/gensim-data#want-to-add-a-new-corpus-or-model; we'll need:

a) Text that motivates adding each model (should be easy), including any links to its original research and preprocessing options, its license etc. Basically a quick summary of "What is this?' and "Who is it for?" b) Code that loads these models (to include in __init__.py; see e.g. fasttext-wiki-news-subwords-300). Again, should be easy, IIRC we already support the gloVe data format.

Cheers!