How to speed up for large dataset

stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings

Apache License 2.0

6.81k stars 1.51k forks source link

How to speed up for large dataset #214

Open linWujl opened 1 year ago

linWujl commented 1 year ago

Hello, my corpus is 700G, is there any way to speed up？

AngledLuffa commented 1 year ago

More threads? Better hardware?

On Wed, Apr 19, 2023, 6:48 PM Linlp @.***> wrote:

Hello, my corpus is 700G, is there any way to speed up？

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWMCP43W4LZYUFGZGDLXCCIXZANCNFSM6AAAAAAXE2JWEU . You are receiving this because you are subscribed to this thread.Message ID: @.***>

linWujl commented 1 year ago

The coocur step has cost about 7500mins and it stills at the merge step.

Is it possible that use spark to construct the cooccurrence statistics and train it with tensorflow?

AngledLuffa commented 1 year ago

We did try converting it to torch at one point, but it wound up being significantly slower than the C version. We may try again sometime. You are welcome to try...

Do you have enough memory? Might be worth checking top