uhh-lt / sensegram

Making sense embedding out of word embeddings using graph-based word sense induction
http://uhh-lt.github.io/sensegram
212 stars 50 forks source link

Recieving error when trying to convert word embeddings to sense embeddings #26

Closed samarohith closed 5 years ago

samarohith commented 5 years ago

I ran the code to convert word embeddings to sense embeddings. When I tried using the file "wiki.txt.word_vectors" and "ukwac.txt.word_vectors". I am receiving the error "you must first build vocabulary before training the model." Below is the traceback (for ukwac.txt.word_vectors).

Traceback (most recent call last): File "/content/sensegram/train.py", line 114, in main() File "/content/sensegram/train.py", line 79, in main detect_bigrams=args.bigrams, phrases_fpath=args.phrases) File "/content/sensegram/word_embeddings.py", line 213, in learn_word_embeddings iter=iter_num) File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 767, in init fast_version=FAST_VERSION) File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 763, in init end_alpha=self.min_alpha, compute_loss=compute_loss) File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 892, in train queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks) File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1081, in train kwargs) File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 536, in train total_words=total_words, kwargs) File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity raise RuntimeError("you must first build vocabulary before training the model") RuntimeError: you must first build vocabulary before training the model

It also seems like the model is running through all the sentences. Just before the traceback it also shows :

2019-10-09 16:41:17,609 : INFO : collected 1000706 word types from a corpus of 1000707 raw words and 1000708 sentences 2019-10-09 16:41:17,609 : INFO : Loading a fresh vocabulary 2019-10-09 16:41:17,917 : INFO : effective_min_count=10 retains 0 unique words (0% of original 1000706, drops 1000706) 2019-10-09 16:41:17,917 : INFO : effective_min_count=10 leaves 0 word corpus (0% of original 1000707, drops 1000707) 2019-10-09 16:41:17,917 : INFO : deleting the raw counts dictionary of 1000706 items 2019-10-09 16:41:17,938 : INFO : sample=0.001 downsamples 0 most-common words 2019-10-09 16:41:17,938 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2019-10-09 16:41:17,938 : INFO : estimated required memory for 0 words and 300 dimensions: 0 bytes 2019-10-09 16:41:17,938 : INFO : resetting layer weights