Attemping to train on own corpus

garrett-yoon commented 3 years ago

Hi, I'm unsure why the loss is trending towards infinity training on my small corpus? the vector.txt is filled with 'nan'. I adjusted the 'eta'/learning rate and still having problems.

gcc src/glove.c -o build/glove -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/shuffle.c -o build/shuffle -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/cooccur.c -o build/cooccur -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result gcc src/vocab_count.c -o build/vocab_count -lm -pthread -Ofast -march=native -funroll-loops -Wno-unused-result BUILDING VOCABULARY Processed 146823 tokens. Counted 2957 unique words. Truncating vocabulary at min count 5. Using vocabulary of size 1284.

COUNTING COOCCURRENCES window size: 15 context: symmetric max product: 13752509 overflow length: 38028356 Reading vocab from file "vocab.txt"...loaded 1284 words. Building lookup table...table contains 1648657 elements. Processed 146823 tokens. Writing cooccurrences to disk......2 files in total. Merging cooccurrence files: processed 325221 lines.

SHUFFLING COOCCURRENCES array size: 255013683 Shuffling by chunks: processed 325221 lines. Wrote 1 temporary file(s). Merging temp files: processed 325221 lines.

TRAINING MODEL Read 325221 lines. Initializing parameters...done. vector size: 500 vocab size: 1284 x_max: 10.000000 alpha: 0.750000 iter: 001, cost: nan iter: 002, cost: nan iter: 003, cost: nan iter: 004, cost: nan iter: 005, cost: nan iter: 006, cost: nan iter: 007, cost: nan iter: 008, cost: nan iter: 009, cost: nan iter: 010, cost: nan iter: 011, cost: nan iter: 012, cost: nan iter: 013, cost: nan iter: 014, cost: nan iter: 015, cost: nan

AngledLuffa commented 3 years ago

Is it possible to share the data files with us?

AngledLuffa commented 3 years ago

Thanks. I can probably work on this next week or the week after. Was there a reason you closed the bug?

On Thu, Dec 17, 2020 at 9:13 AM Garrett Yoon notifications@github.com wrote:

Closed #183 https://github.com/stanfordnlp/GloVe/issues/183.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/183#event-4126168942, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWNYKDVKOLXMF7TWCL3SVI35NANCNFSM4U6XATSQ .

garrett-yoon commented 3 years ago

Hi there,

No worries about it. I think I figured out the issue.

AngledLuffa commented 3 years ago

What is it?

On Thu, Dec 17, 2020, 9:29 AM Garrett Yoon notifications@github.com wrote:

Hi there,

No worries about it. I think I figured out the issue.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/GloVe/issues/183#issuecomment-747584871, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWO6MZVGDE7W5LQXDLTSVI5YNANCNFSM4U6XATSQ .

stanfordnlp / GloVe

Attemping to train on own corpus #183