piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.61k stars 4.37k forks source link

RuntimeError: you must first build vocabulary before training the model #541

Closed Rahulvks closed 8 years ago

Rahulvks commented 8 years ago

Facing error when importing Word2vec

Traceback (most recent call last): File "<pyshell#0>", line 1, in import word2vec File "word2vec.py", line 14, in model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4) File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 432, in init self.train(sentences) File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 690, in train raise RuntimeError("you must first build vocabulary before training the model") RuntimeError: you must first build vocabulary before training the model

How to train first and build vocabulary ? Kindly anyone help

piskvorky commented 8 years ago

Can you paste the full log, at INFO level?

Rahulvks commented 8 years ago

Import word2vec

Warning (from warnings module): File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/init.py", line 10 version = import('pkg_resources').get_distribution('gensim').version UserWarning: Module word2vec was already imported from word2vec.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path 2015-11-24 10:00:51,101: INFO : collecting all words and their counts 2015-11-24 10:00:51,102: INFO : collected 0 word types from a corpus of 0 raw words and 0 sentences 2015-11-24 10:00:51,103: INFO : min_count=1 retains 0 unique words (drops 0) 2015-11-24 10:00:51,104: INFO : min_count leaves 0 word corpus (0% of original 0) 2015-11-24 10:00:51,104: INFO : deleting the raw counts dictionary of 0 items 2015-11-24 10:00:51,105: INFO : sample=0 downsamples 0 most-common words 2015-11-24 10:00:51,106: INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2015-11-24 10:00:51,107: INFO : estimated required memory for 0 words and 100 dimensions: 0 bytes 2015-11-24 10:00:51,108: INFO : min_count=1 retains 0 unique words (drops 0) 2015-11-24 10:00:51,109: INFO : min_count leaves 0 word corpus (0% of original 0) 2015-11-24 10:00:51,110: INFO : deleting the raw counts dictionary of 0 items 2015-11-24 10:00:51,111: INFO : sample=0 downsamples 0 most-common words 2015-11-24 10:00:51,113: INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2015-11-24 10:00:51,115: INFO : estimated required memory for 0 words and 100 dimensions: 0 bytes 2015-11-24 10:00:51,118: INFO : constructing a huffman tree from 0 words 2015-11-24 10:00:51,120: INFO : resetting layer weights 2015-11-24 10:00:51,121: INFO : training model with 4 workers on 0 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0

Traceback (most recent call last): File "<pyshell#0>", line 1, in import word2vec File "word2vec.py", line 14, in model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4) File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 432, in init self.train(sentences) File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 690, in train raise RuntimeError("you must first build vocabulary before training the model") RuntimeError: you must first build vocabulary before training the model

piskvorky commented 8 years ago

Looks like your sentences contains no data at all, hence the error. Check your input iterator. Read the word2vec tutorial here.

The error doesn't seem to be connected to "importing gensim" either -- I'll change the issue's title.

Rahulvks commented 8 years ago

hi sir,Thanks. I corrected Vocb error but

import gensim, logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = [['new', 'year'], ['old', 'year']] model = gensim.models.Word2Vec(sentences, min_count=1)

2015-11-24 12:06:30,852 : INFO : expecting 2 examples, matching count from corpus used for vocabulary survey 2015-11-24 12:06:30,854 : INFO : reached end of input; waiting to finish 1 outstanding jobs 2015-11-24 12:06:30,855 : INFO : training on 4 raw words took 0.0s, 2749 trained words/s

model.build_vocab(sentences)

Error File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 632, in sort_vocab raise RuntimeError("must sort before initializing vectors/weights") RuntimeError: must sort before initializing vectors/weights

How to correct RuntimeError: must sort before initializing vectors/weights this error ?

gojomo commented 8 years ago

If you supply sentences at class-initialization, then it automatically does both the build_vocab() step and the train() step. There's no reason to call build_vocab() again.

chmodsss commented 8 years ago

@gojomo: If I am doing build_vocab() and train() in two steps. Is it possible to do build_vocab() multiple times on different datasets and then train the whole model ?. I get the same RuntimeError.

martianmartian commented 7 years ago

so can you only do this?

# import modules & set up logging
import gensim, logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
print(model.wv['first'])
chinmayapancholi13 commented 7 years ago

@martianmartian You can initialize the class, build the vocabulary and train the model in 3 different steps as follows :

from gensim.models import word2vec as w2v

sentences = [['first', 'sentence'], ['second', 'sentence']]

model = w2v.Word2Vec(min_count=1)
model.build_vocab(sentences)
model.train(sentences)
print model.wv['first']

Otherwise, all the 3 steps are performed one after another if you specify sentences at the time of initialization of the class (as written in your code snippet above).

rachhitgarg commented 6 years ago

Please Help ( Thanks in Advance ) I have processed the data and made a text file for the articles.....but while training i am getting error Here is my code that I am running and followed is error and log i am getting .....

import logging import os.path import sys import multiprocessing

from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence

if name == 'main': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program)

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))

# check and process input arguments

if len(sys.argv) < 3:
    print (globals()['__doc__'] % locals())
    sys.exit(1)
inp, outp = sys.argv[1:3]

model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())

# trim unneeded model memory = use (much) less RAM
model.init_sims(replace=True)

model.save(outp)

C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2018-06-26 11:07:41,728 : INFO : running train_word2vec_model.py wiki.en.text wi ki.en.word2vec.model 2018-06-26 11:07:41,729 : INFO : collecting all words and their counts 2018-06-26 11:07:41,762 : INFO : collected 0 word types from a corpus of 0 raw w ords and 0 sentences 2018-06-26 11:07:41,762 : INFO : Loading a fresh vocabulary 2018-06-26 11:07:41,763 : INFO : min_count=5 retains 0 unique words (0% of origi nal 0, drops 0) 2018-06-26 11:07:41,764 : INFO : min_count=5 leaves 0 word corpus (0% of origina l 0, drops 0) 2018-06-26 11:07:41,764 : INFO : deleting the raw counts dictionary of 0 items 2018-06-26 11:07:41,765 : INFO : sample=0.001 downsamples 0 most-common words 2018-06-26 11:07:41,765 : INFO : downsampling leaves estimated 0 word corpus (0. 0% of prior 0) 2018-06-26 11:07:41,766 : INFO : estimated required memory for 0 words and 400 d imensions: 0 bytes 2018-06-26 11:07:41,767 : INFO : resetting layer weights Traceback (most recent call last): File "train_word2vec_model.py", line 33, in model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers =multiprocessing.cpu_count()) File "C:\Program Files\Python36-32\lib\site-packages\gensim\models\word2vec.py ", line 527, in init fast_version=FAST_VERSION) File "C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve c.py", line 338, in init end_alpha=self.min_alpha, compute_loss=compute_loss) File "C:\Program Files\Python36-32\lib\site-packages\gensim\models\word2vec.py ", line 611, in train queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_l oss, callbacks=callbacks) File "C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve c.py", line 569, in train queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_l oss, callbacks=callbacks) File "C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve c.py", line 241, in train total_words=total_words, **kwargs) File "C:\Program Files\Python36-32\lib\site-packages\gensim\models\base_any2ve c.py", line 601, in _check_training_sanity raise RuntimeError("you must first build vocabulary before training the mode l") RuntimeError: you must first build vocabulary before training the model

menshikh-iv commented 6 years ago

@rachhitgarg change min_count=5 -> min_count=1, this will help

rachhitgarg commented 6 years ago

@menshikh-iv ...thanks for the help but even changes are not working.......i am confused what to do....?

rachhitgarg commented 6 years ago

@RVKRM can u resolved the vocab error RuntimeError: you must first build vocabulary before training the model...

please help me also

menshikh-iv commented 6 years ago

@rachhitgarg share your inp file please

piskvorky commented 6 years ago

@rachhitgarg this line 2018-06-26 11:07:41,762 : INFO : collected 0 word types from a corpus of 0 raw words and 0 sentences in your log suggests your training corpus is empty! No sentences at all.

Double check your inp file, and maybe print a few sentences from LineSentence(inp), to make sure it contains what you think it contains.

rachhitgarg commented 6 years ago

@menshikh-iv

@piskvorky input data is wiki article

wiki.en.zip (file after processing wiki article 167MB data )

I created wiki.en.text using (process_wiki.py) process_wiki.txt

train_word2vec_model.txt

menshikh-iv commented 6 years ago

@rachhitgarg your file wiki.en.text have no content at all (only spaces)

rachhitgarg commented 6 years ago

@menshikh-iv Thanks a ton :+1: buy why i am getting not content in wiki.en.text ...i processed it with the process_wiki.py...is there any mistake in that file ...because i am getting 40 KB file after processing this and even while execution it is showing (this much articles ) saved ......

menshikh-iv commented 6 years ago

@rachhitgarg probably reason in

if six.PY3:
    output.write(str(b' ', 'utf-8') + '\n')

you write only spaces, that's all

rachhitgarg commented 6 years ago

@menshikh-iv What do you suggest because i have tried if six.PY3: output.write(b' '.join(text).decode('utf-8') + '\n')

it is showing error

File “process_wiki.py”, line 30, in output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’) TypeError: sequence item 0: expected a bytes-like object, str found

menshikh-iv commented 6 years ago

@rachhitgarg remove b from b' '.join(...)

rachhitgarg commented 6 years ago

@menshikh-iv Same Error i am getting

Traceback (most recent call last): File "process_wiki.py", line 37, in output.write(' '.join(text).decode('utf-8') + '\n') AttributeError: 'str' object has no attribute 'decode'

I have tried by removing .decode('utf-8').... getting error

C:\Users\Neha>python process_wiki.py wiki.xml.bz2 wiki.en.text C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2018-06-26 17:42:14,117: INFO: running process_wiki.py wiki.xml.bz2 wiki.en.text

C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2018-06-26 17:42:15,629: INFO: adding document #0 to Dictionary(0 unique tokens: []) 2018-06-26 17:45:14,921: INFO: adding document #10000 to Dictionary(447974 uniqu e tokens: ['abandoning', 'abandonment', 'abdelrahim', 'abdullah', 'ability']...)

2018-06-26 17:46:50,058: INFO: finished iterating over Wikipedia corpus of 14896 documents with 47207482 positions (total 19839 articles, 47228312 positions bef ore pruning articles shorter than 50 words) 2018-06-26 17:46:50,129: INFO: built Dictionary(560404 unique tokens: ['abandoni ng', 'abandonment', 'abdelrahim', 'abdullah', 'ability']...) from 14896 document s (total 47207482 corpus positions) C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") Traceback (most recent call last): File "process_wiki.py", line 37, in output.write(' '.join(text) + '\n') File "C:\Program Files\Python36-32\lib\encodings\cp1252.py", line 19, in encod e return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode characters in position 1426-142 7: character maps to

rachhitgarg commented 6 years ago

@piskvorky @menshikh-iv I am not able to understand if a small data set is showing these kind of error , how can i check for wiki dumps.......

plz help me in creating vector file for wiki dumps of latest articles(12 GB) around.....

menshikh-iv commented 6 years ago

@rachhitgarg your dataset is broken, that's not a problem of a model (check stack trace carefully, error not from gensim code).

rachhitgarg commented 6 years ago

@menshikh-iv First of all I am very sorry...I am bothering you...you are such a nice person you are replying to all my silly doubts...Thank You So So So... much......Seriously at this moment i really seeking for sum1 to help me.....Thanks once Again

Let me tell you that..... i am just using the wiki article dump downloaded from wiki......

enwiki-latest-pages-article.xml.bz2

rachhitgarg commented 6 years ago

@menshikh-iv did you try the code? I wud request you to please run it once and let me know if you get expected output.....

menshikh-iv commented 6 years ago

@rachhitgarg I wrote example for you, hope that this is enough

import logging
from gensim.corpora import WikiCorpus
from gensim.models.word2vec import Word2Vec, LineSentence
from smart_open import smart_open

logging.basicConfig(level=logging.INFO)

dump = "parsed-wiki.txt"
with smart_open(dump, 'wb') as outfile:
    wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2", lemmatize=False, dictionary={})
    for idx, text in enumerate(wiki.get_texts()):
        outfile.write((" ".join(text) + "\n").encode("utf-8"))

        # break process (because we don't want to extract all wiki data, we want to show that this works)
        if idx == 1000:  
            break

model = Word2Vec(LineSentence(dump), min_count=1)
rachhitgarg commented 6 years ago

error

I tried it on 23 MB wiki article file @menshikh-iv I am sorry to say.......even this is not working for me...execution is in infinite loop....from the last 20 minutes the it is still in execution and showing execution error ......

menshikh-iv commented 6 years ago

Windows issues with multiprocessing (please google it yourself).

WikiCorpus use multiprocessing internally, if I remember correctly, you need wrap my code with

if __name__ == '__main__':
piskvorky commented 6 years ago

This is really a bad place for such support and open (unrelated) discussions -- please use the Gensim mailing list.

rachhitgarg commented 6 years ago

@menshikh-iv Thank You so much ....You helped me in very nice way...Thanks a ton....

problem solved with editing that process_wiki file by changing saving code for byte file :)

monterga commented 5 years ago

@rachhitgarg I have the same problem with you. Could you help me to fix. Thanks so much

susmithadachu commented 4 years ago

i have some problem please help 2020-04-15 10:42:17,848 : INFO : collecting all words and their counts 2020-04-15 10:42:17,864 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types 2020-04-15 10:42:17,864 : INFO : collected 10 word types from a corpus of 10 raw words and 10 sentences 2020-04-15 10:42:17,864 : INFO : Loading a fresh vocabulary 2020-04-15 10:42:17,864 : INFO : effective_min_count=50 retains 0 unique words (0% of original 10, drops 10) 2020-04-15 10:42:17,864 : INFO : effective_min_count=50 leaves 0 word corpus (0% of original 10, drops 10) 2020-04-15 10:42:17,864 : INFO : deleting the raw counts dictionary of 10 items 2020-04-15 10:42:17,864 : INFO : sample=0.001 downsamples 0 most-common words 2020-04-15 10:42:17,865 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2020-04-15 10:42:17,865 : INFO : estimated required memory for 0 words and 500 dimensions: 0 bytes 2020-04-15 10:42:17,865 : INFO : resetting layer weights Traceback (most recent call last): File "/home/anu/PycharmProjects/untitled/train_word2vec_model.py", line 78, in model = train_embeddings(options.input_corpus_path) File "/home/anu/PycharmProjects/untitled/train_word2vec_model.py", line 31, in train_embeddings workers=options.n_threads, File "/home/anu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 783, in init fast_version=FAST_VERSION) File "/home/anu/.local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 763, in init end_alpha=self.min_alpha, compute_loss=compute_loss) File "/home/anu/.local/lib/python2.7/site-packages/gensim/models/word2vec.py", line 910, in train queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks) File "/home/anu/.local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 1081, in train kwargs) File "/home/anu/.local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 536, in train total_words=total_words, kwargs) File "/home/anu/.local/lib/python2.7/site-packages/gensim/models/base_any2vec.py", line 1187, in _check_training_sanity raise RuntimeError("you must first build vocabulary before training the model") RuntimeError: you must first build vocabulary before training the model

piskvorky commented 4 years ago

@susmithadachu your corpus only has 10 words, and you're filtering them all out. Read your log carefully.