I ran the code to convert word embeddings to sense embeddings. When I tried using the file "wiki.txt.word_vectors" and "ukwac.txt.word_vectors". I am receiving the error "you must first build vocabulary before training the model." Below is the traceback (for ukwac.txt.word_vectors).
Traceback (most recent call last):
File "/content/sensegram/train.py", line 114, in
main()
File "/content/sensegram/train.py", line 79, in main
detect_bigrams=args.bigrams, phrases_fpath=args.phrases)
File "/content/sensegram/word_embeddings.py", line 213, in learn_word_embeddings
iter=iter_num)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 767, in init
fast_version=FAST_VERSION)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 763, in init
end_alpha=self.min_alpha, compute_loss=compute_loss)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 892, in train
queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1081, in train
kwargs)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 536, in train
total_words=total_words, kwargs)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
It also seems like the model is running through all the sentences. Just before the traceback it also shows :
2019-10-09 16:41:17,609 : INFO : collected 1000706 word types from a corpus of 1000707 raw words and 1000708 sentences
2019-10-09 16:41:17,609 : INFO : Loading a fresh vocabulary
2019-10-09 16:41:17,917 : INFO : effective_min_count=10 retains 0 unique words (0% of original 1000706, drops 1000706)
2019-10-09 16:41:17,917 : INFO : effective_min_count=10 leaves 0 word corpus (0% of original 1000707, drops 1000707)
2019-10-09 16:41:17,917 : INFO : deleting the raw counts dictionary of 1000706 items
2019-10-09 16:41:17,938 : INFO : sample=0.001 downsamples 0 most-common words
2019-10-09 16:41:17,938 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0)
2019-10-09 16:41:17,938 : INFO : estimated required memory for 0 words and 300 dimensions: 0 bytes
2019-10-09 16:41:17,938 : INFO : resetting layer weights
I ran the code to convert word embeddings to sense embeddings. When I tried using the file "wiki.txt.word_vectors" and "ukwac.txt.word_vectors". I am receiving the error "you must first build vocabulary before training the model." Below is the traceback (for ukwac.txt.word_vectors).
Traceback (most recent call last): File "/content/sensegram/train.py", line 114, in
main()
File "/content/sensegram/train.py", line 79, in main
detect_bigrams=args.bigrams, phrases_fpath=args.phrases)
File "/content/sensegram/word_embeddings.py", line 213, in learn_word_embeddings
iter=iter_num)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 767, in init
fast_version=FAST_VERSION)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 763, in init
end_alpha=self.min_alpha, compute_loss=compute_loss)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/word2vec.py", line 892, in train
queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1081, in train
kwargs)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 536, in train
total_words=total_words, kwargs)
File "/usr/local/lib/python3.6/dist-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
It also seems like the model is running through all the sentences. Just before the traceback it also shows :
2019-10-09 16:41:17,609 : INFO : collected 1000706 word types from a corpus of 1000707 raw words and 1000708 sentences 2019-10-09 16:41:17,609 : INFO : Loading a fresh vocabulary 2019-10-09 16:41:17,917 : INFO : effective_min_count=10 retains 0 unique words (0% of original 1000706, drops 1000706) 2019-10-09 16:41:17,917 : INFO : effective_min_count=10 leaves 0 word corpus (0% of original 1000707, drops 1000707) 2019-10-09 16:41:17,917 : INFO : deleting the raw counts dictionary of 1000706 items 2019-10-09 16:41:17,938 : INFO : sample=0.001 downsamples 0 most-common words 2019-10-09 16:41:17,938 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2019-10-09 16:41:17,938 : INFO : estimated required memory for 0 words and 300 dimensions: 0 bytes 2019-10-09 16:41:17,938 : INFO : resetting layer weights