piskvorky / gensim-data

Data repository for pretrained NLP models and NLP corpora.
https://rare-technologies.com/new-api-for-pretrained-nlp-models-and-datasets-in-gensim/
GNU Lesser General Public License v2.1
965 stars 128 forks source link

New Corpus - Corpus of Historical American (COHA) English #49

Open ResearchLabDev opened 2 years ago

ResearchLabDev commented 2 years ago

Hello,

I was wondering if it would be possible to add the pre-trained word vectors from the Corpus of Historical American (COHA) English from Stanford found here - https://nlp.stanford.edu/projects/histwords/

In particular:

Genre-Balanced American English (1830s-2000s, 475 million words, 300d vectors ~2.34 GB)-- Download

I think these would be extremely valuable to researchers looking to evaluate semantic evolution.

Thank you for the consideration!