stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Source data for training embedding glove.840B.300d #190

Closed lfoppiano closed 3 years ago

lfoppiano commented 3 years ago

Dear all, I'm collecting information about various embedding approaches and I'm looking for information about how you did perform the training the embeddings: `Common Crawl (840B tokens, 2.2M vocab, cased, 300d vectors, 2.03 GB download): glove.840B.300d.zip``

The paper does not discuss them indeed.

In particular, I'm interested in:

Thank you in advance

AngledLuffa commented 3 years ago

The intent was English only, although other things may have snuck in. I don't think we have records on which version of Common Crawl, unfortunately.

lfoppiano commented 3 years ago

Thank you

lfoppiano commented 3 years ago

I have another question, do you have, by any chance, the command's parameters that were used to train these embeddings?

AngledLuffa commented 3 years ago

I'm sorry, but the person who did the original training is long gone and didn't leave behind any notes.

lfoppiano commented 3 years ago

Thanks