stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Train glove on oscar dataset #186

Closed rzsgrt closed 3 years ago

rzsgrt commented 3 years ago

Hi, any guide for train glove from multiple source file (e.g. from oscar dataset)?

AngledLuffa commented 3 years ago

I don't know any of the specifics, but is it not possible to simply merge the files?

rzsgrt commented 3 years ago

Hi, i can't merge all files. Also, I think we can't do train glove using generator right? Since we need to calculate co-occurrence on whole corpus.

AngledLuffa commented 3 years ago

The fundamental problem is there's only one input file in the glove code, AFAIK.

rzsgrt commented 3 years ago

I see, its, because we need co occurrence matrix from whole corpus.