Closed Rahulvks closed 8 years ago
Can you paste the full log, at INFO level?
Import word2vec
Warning (from warnings module): File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/init.py", line 10 version = import('pkg_resources').get_distribution('gensim').version UserWarning: Module word2vec was already imported from word2vec.pyc, but /usr/local/lib/python2.7/dist-packages is being added to sys.path 2015-11-24 10:00:51,101: INFO : collecting all words and their counts 2015-11-24 10:00:51,102: INFO : collected 0 word types from a corpus of 0 raw words and 0 sentences 2015-11-24 10:00:51,103: INFO : min_count=1 retains 0 unique words (drops 0) 2015-11-24 10:00:51,104: INFO : min_count leaves 0 word corpus (0% of original 0) 2015-11-24 10:00:51,104: INFO : deleting the raw counts dictionary of 0 items 2015-11-24 10:00:51,105: INFO : sample=0 downsamples 0 most-common words 2015-11-24 10:00:51,106: INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2015-11-24 10:00:51,107: INFO : estimated required memory for 0 words and 100 dimensions: 0 bytes 2015-11-24 10:00:51,108: INFO : min_count=1 retains 0 unique words (drops 0) 2015-11-24 10:00:51,109: INFO : min_count leaves 0 word corpus (0% of original 0) 2015-11-24 10:00:51,110: INFO : deleting the raw counts dictionary of 0 items 2015-11-24 10:00:51,111: INFO : sample=0 downsamples 0 most-common words 2015-11-24 10:00:51,113: INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0) 2015-11-24 10:00:51,115: INFO : estimated required memory for 0 words and 100 dimensions: 0 bytes 2015-11-24 10:00:51,118: INFO : constructing a huffman tree from 0 words 2015-11-24 10:00:51,120: INFO : resetting layer weights 2015-11-24 10:00:51,121: INFO : training model with 4 workers on 0 vocabulary and 100 features, using sg=1 hs=1 sample=0 and negative=0
Traceback (most recent call last):
File "<pyshell#0>", line 1, in
Looks like your sentences
contains no data at all, hence the error. Check your input iterator. Read the word2vec tutorial here.
The error doesn't seem to be connected to "importing gensim" either -- I'll change the issue's title.
hi sir,Thanks. I corrected Vocb error but
import gensim, logging logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) sentences = [['new', 'year'], ['old', 'year']] model = gensim.models.Word2Vec(sentences, min_count=1)
2015-11-24 12:06:30,852 : INFO : expecting 2 examples, matching count from corpus used for vocabulary survey 2015-11-24 12:06:30,854 : INFO : reached end of input; waiting to finish 1 outstanding jobs 2015-11-24 12:06:30,855 : INFO : training on 4 raw words took 0.0s, 2749 trained words/s
model.build_vocab(sentences)
Error File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 632, in sort_vocab raise RuntimeError("must sort before initializing vectors/weights") RuntimeError: must sort before initializing vectors/weights
How to correct RuntimeError: must sort before initializing vectors/weights this error ?
If you supply sentences
at class-initialization, then it automatically does both the build_vocab()
step and the train()
step. There's no reason to call build_vocab()
again.
@gojomo:
If I am doing build_vocab()
and train()
in two steps. Is it possible to do build_vocab()
multiple times on different datasets and then train the whole model ?. I get the same RuntimeError.
so can you only do this?
# import modules & set up logging
import gensim, logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
sentences = [['first', 'sentence'], ['second', 'sentence']]
# train word2vec on the two sentences
model = gensim.models.Word2Vec(sentences, min_count=1)
print(model.wv['first'])
@martianmartian You can initialize the class, build the vocabulary and train the model in 3 different steps as follows :
from gensim.models import word2vec as w2v
sentences = [['first', 'sentence'], ['second', 'sentence']]
model = w2v.Word2Vec(min_count=1)
model.build_vocab(sentences)
model.train(sentences)
print model.wv['first']
Otherwise, all the 3 steps are performed one after another if you specify sentences
at the time of initialization of the class (as written in your code snippet above).
Please Help ( Thanks in Advance ) I have processed the data and made a text file for the articles.....but while training i am getting error Here is my code that I am running and followed is error and log i am getting .....
import logging import os.path import sys import multiprocessing
from gensim.corpora import WikiCorpus from gensim.models import Word2Vec from gensim.models.word2vec import LineSentence
if name == 'main': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program)
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s')
logging.root.setLevel(level=logging.INFO)
logger.info("running %s" % ' '.join(sys.argv))
# check and process input arguments
if len(sys.argv) < 3:
print (globals()['__doc__'] % locals())
sys.exit(1)
inp, outp = sys.argv[1:3]
model = Word2Vec(LineSentence(inp), size=400, window=5, min_count=5, workers=multiprocessing.cpu_count())
# trim unneeded model memory = use (much) less RAM
model.init_sims(replace=True)
model.save(outp)
C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning
: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
2018-06-26 11:07:41,728 : INFO : running train_word2vec_model.py wiki.en.text wi
ki.en.word2vec.model
2018-06-26 11:07:41,729 : INFO : collecting all words and their counts
2018-06-26 11:07:41,762 : INFO : collected 0 word types from a corpus of 0 raw w
ords and 0 sentences
2018-06-26 11:07:41,762 : INFO : Loading a fresh vocabulary
2018-06-26 11:07:41,763 : INFO : min_count=5 retains 0 unique words (0% of origi
nal 0, drops 0)
2018-06-26 11:07:41,764 : INFO : min_count=5 leaves 0 word corpus (0% of origina
l 0, drops 0)
2018-06-26 11:07:41,764 : INFO : deleting the raw counts dictionary of 0 items
2018-06-26 11:07:41,765 : INFO : sample=0.001 downsamples 0 most-common words
2018-06-26 11:07:41,765 : INFO : downsampling leaves estimated 0 word corpus (0.
0% of prior 0)
2018-06-26 11:07:41,766 : INFO : estimated required memory for 0 words and 400 d
imensions: 0 bytes
2018-06-26 11:07:41,767 : INFO : resetting layer weights
Traceback (most recent call last):
File "train_word2vec_model.py", line 33, in
@rachhitgarg change min_count=5
-> min_count=1
, this will help
@menshikh-iv ...thanks for the help but even changes are not working.......i am confused what to do....?
@RVKRM can u resolved the vocab error RuntimeError: you must first build vocabulary before training the model...
please help me also
@rachhitgarg share your inp
file please
@rachhitgarg this line 2018-06-26 11:07:41,762 : INFO : collected 0 word types from a corpus of 0 raw words and 0 sentences
in your log suggests your training corpus is empty! No sentences at all.
Double check your inp
file, and maybe print a few sentences from LineSentence(inp)
, to make sure it contains what you think it contains.
@menshikh-iv
@piskvorky input data is wiki article
wiki.en.zip (file after processing wiki article 167MB data )
I created wiki.en.text using (process_wiki.py) process_wiki.txt
@rachhitgarg your file wiki.en.text
have no content at all (only spaces)
@menshikh-iv Thanks a ton :+1: buy why i am getting not content in wiki.en.text ...i processed it with the process_wiki.py...is there any mistake in that file ...because i am getting 40 KB file after processing this and even while execution it is showing (this much articles ) saved ......
@rachhitgarg probably reason in
if six.PY3:
output.write(str(b' ', 'utf-8') + '\n')
you write only spaces, that's all
@menshikh-iv What do you suggest because i have tried if six.PY3: output.write(b' '.join(text).decode('utf-8') + '\n')
it is showing error
File “process_wiki.py”, line 30, in output.write(b’ ‘.join(text).decode(‘utf-8’) + ‘\n’) TypeError: sequence item 0: expected a bytes-like object, str found
@rachhitgarg remove b
from b' '.join(...)
@menshikh-iv Same Error i am getting
Traceback (most recent call last):
File "process_wiki.py", line 37, in
I have tried by removing .decode('utf-8').... getting error
C:\Users\Neha>python process_wiki.py wiki.xml.bz2 wiki.en.text C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2018-06-26 17:42:14,117: INFO: running process_wiki.py wiki.xml.bz2 wiki.en.text
C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning : detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2018-06-26 17:42:15,629: INFO: adding document #0 to Dictionary(0 unique tokens: []) 2018-06-26 17:45:14,921: INFO: adding document #10000 to Dictionary(447974 uniqu e tokens: ['abandoning', 'abandonment', 'abdelrahim', 'abdullah', 'ability']...)
2018-06-26 17:46:50,058: INFO: finished iterating over Wikipedia corpus of 14896
documents with 47207482 positions (total 19839 articles, 47228312 positions bef
ore pruning articles shorter than 50 words)
2018-06-26 17:46:50,129: INFO: built Dictionary(560404 unique tokens: ['abandoni
ng', 'abandonment', 'abdelrahim', 'abdullah', 'ability']...) from 14896 document
s (total 47207482 corpus positions)
C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning
: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning
: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
C:\Program Files\Python36-32\lib\site-packages\gensim\utils.py:1197: UserWarning
: detected Windows; aliasing chunkize to chunkize_serial
warnings.warn("detected Windows; aliasing chunkize to chunkize_serial")
Traceback (most recent call last):
File "process_wiki.py", line 37, in
@piskvorky @menshikh-iv I am not able to understand if a small data set is showing these kind of error , how can i check for wiki dumps.......
plz help me in creating vector file for wiki dumps of latest articles(12 GB) around.....
@rachhitgarg your dataset is broken, that's not a problem of a model (check stack trace carefully, error not from gensim code).
@menshikh-iv First of all I am very sorry...I am bothering you...you are such a nice person you are replying to all my silly doubts...Thank You So So So... much......Seriously at this moment i really seeking for sum1 to help me.....Thanks once Again
Let me tell you that..... i am just using the wiki article dump downloaded from wiki......
enwiki-latest-pages-article.xml.bz2
@menshikh-iv did you try the code? I wud request you to please run it once and let me know if you get expected output.....
@rachhitgarg I wrote example for you, hope that this is enough
import logging
from gensim.corpora import WikiCorpus
from gensim.models.word2vec import Word2Vec, LineSentence
from smart_open import smart_open
logging.basicConfig(level=logging.INFO)
dump = "parsed-wiki.txt"
with smart_open(dump, 'wb') as outfile:
wiki = WikiCorpus("enwiki-latest-pages-articles.xml.bz2", lemmatize=False, dictionary={})
for idx, text in enumerate(wiki.get_texts()):
outfile.write((" ".join(text) + "\n").encode("utf-8"))
# break process (because we don't want to extract all wiki data, we want to show that this works)
if idx == 1000:
break
model = Word2Vec(LineSentence(dump), min_count=1)
I tried it on 23 MB wiki article file @menshikh-iv I am sorry to say.......even this is not working for me...execution is in infinite loop....from the last 20 minutes the it is still in execution and showing execution error ......
Windows issues with multiprocessing (please google it yourself).
WikiCorpus
use multiprocessing internally, if I remember correctly, you need wrap my code with
if __name__ == '__main__':
This is really a bad place for such support and open (unrelated) discussions -- please use the Gensim mailing list.
@menshikh-iv Thank You so much ....You helped me in very nice way...Thanks a ton....
problem solved with editing that process_wiki file by changing saving code for byte file :)
@rachhitgarg I have the same problem with you. Could you help me to fix. Thanks so much
i have some problem please help
2020-04-15 10:42:17,848 : INFO : collecting all words and their counts
2020-04-15 10:42:17,864 : INFO : PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
2020-04-15 10:42:17,864 : INFO : collected 10 word types from a corpus of 10 raw words and 10 sentences
2020-04-15 10:42:17,864 : INFO : Loading a fresh vocabulary
2020-04-15 10:42:17,864 : INFO : effective_min_count=50 retains 0 unique words (0% of original 10, drops 10)
2020-04-15 10:42:17,864 : INFO : effective_min_count=50 leaves 0 word corpus (0% of original 10, drops 10)
2020-04-15 10:42:17,864 : INFO : deleting the raw counts dictionary of 10 items
2020-04-15 10:42:17,864 : INFO : sample=0.001 downsamples 0 most-common words
2020-04-15 10:42:17,865 : INFO : downsampling leaves estimated 0 word corpus (0.0% of prior 0)
2020-04-15 10:42:17,865 : INFO : estimated required memory for 0 words and 500 dimensions: 0 bytes
2020-04-15 10:42:17,865 : INFO : resetting layer weights
Traceback (most recent call last):
File "/home/anu/PycharmProjects/untitled/train_word2vec_model.py", line 78, in
@susmithadachu your corpus only has 10 words, and you're filtering them all out. Read your log carefully.
Facing error when importing Word2vec
Traceback (most recent call last): File "<pyshell#0>", line 1, in
import word2vec
File "word2vec.py", line 14, in
model = word2vec.Word2Vec(sentences, size=100, window=4, min_count=1, workers=4)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 432, in init
self.train(sentences)
File "/usr/local/lib/python2.7/dist-packages/gensim-0.12.3-py2.7-linux-x86_64.egg/gensim/models/word2vec.py", line 690, in train
raise RuntimeError("you must first build vocabulary before training the model")
RuntimeError: you must first build vocabulary before training the model
How to train first and build vocabulary ? Kindly anyone help