stanfordnlp / GloVe

Software in C and data files for the popular GloVe model for distributed word representations, a.k.a. word vectors or embeddings
Apache License 2.0
6.81k stars 1.51k forks source link

Attemping to train on own corpus #183

Closed garrett-yoon closed 3 years ago

garrett-yoon commented 3 years ago

Hi, I'm unsure why the loss is trending towards infinity training on my small corpus? the vector.txt is filled with 'nan'. I adjusted the 'eta'/learning rate and still having problems.

gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result BUILDING VOCABULARY Processed 146823 tokens. Counted 2957 unique words. Truncating vocabulary at min count 5. Using vocabulary of size 1284.

COUNTING COOCCURRENCES window size: 15 context: symmetric max product: 13752509 overflow length: 38028356 Reading vocab from file "vocab.txt"...loaded 1284 words. Building lookup table...table contains 1648657 elements. Processed 146823 tokens. Writing cooccurrences to disk......2 files in total. Merging cooccurrence files: processed 325221 lines.

SHUFFLING COOCCURRENCES array size: 255013683 Shuffling by chunks: processed 325221 lines. Wrote 1 temporary file(s). Merging temp files: processed 325221 lines.

TRAINING MODEL Read 325221 lines. Initializing parameters...done. vector size: 500 vocab size: 1284 x_max: 10.000000 alpha: 0.750000 iter: 001, cost: nan iter: 002, cost: nan iter: 003, cost: nan iter: 004, cost: nan iter: 005, cost: nan iter: 006, cost: nan iter: 007, cost: nan iter: 008, cost: nan iter: 009, cost: nan iter: 010, cost: nan iter: 011, cost: nan iter: 012, cost: nan iter: 013, cost: nan iter: 014, cost: nan iter: 015, cost: nan

AngledLuffa commented 3 years ago

Is it possible to share the data files with us?

AngledLuffa commented 3 years ago

Thanks. I can probably work on this next week or the week after. Was there a reason you closed the bug?

On Thu, Dec 17, 2020 at 9:13 AM Garrett Yoon notifications@github.com wrote:

Closed #183 https://github.com/stanfordnlp/GloVe/issues/183.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/183#event-4126168942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNYKDVKOLXMF7TWCL3SVI35NANCNFSM4U6XATSQ .

garrett-yoon commented 3 years ago

Hi there,

No worries about it. I think I figured out the issue.

AngledLuffa commented 3 years ago

What is it?

On Thu, Dec 17, 2020, 9:29 AM Garrett Yoon notifications@github.com wrote:

Hi there,

No worries about it. I think I figured out the issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/183#issuecomment-747584871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO6MZVGDE7W5LQXDLTSVI5YNANCNFSM4U6XATSQ .