piskvorky / gensim

Topic Modelling for Humans
https://radimrehurek.com/gensim
GNU Lesser General Public License v2.1
15.66k stars 4.38k forks source link

LDAModel stuck on "using serial LDA version on this node" #1136

Open several27 opened 7 years ago

several27 commented 7 years ago

Hi, I'm having problem with running LdaMulticore on my dataset.

posts_model = LdaMulticore(corpus, num_topics=1300, id2word=posts_corpus['id2tag'], workers=15)

My corpus size is about 2 million, with 0.5 million unique words. I'm running it on EC2 r4.4xlarge. When I lower the number of tags (to 100) is seems to work fine and after the last info message it starts up in usually in 10 minutes, but if I don't it just gets stuck on (I was waiting more than two hours).

INFO : using symmetric alpha at 0.01
INFO : using symmetric eta at 1.957169306888
INFO : using serial LDA version on this node

What's happening after displaying this last info message above? Is there any way of additionally debugging it? I tried using python debugger and it seems to be stuck at self.expElogbeta = np.exp(dirichlet_expectation(self.state.sstats)) at line 329 in the ldamodel.py file, but it might have been something else.

Also, when I terminated the script after two hours, I got at least thousand messages like:

ValueError: semaphore or lock released too many times
bfolkens commented 7 years ago

Same here: workers=40, num_topics=10000 Corpus size: 112M

several27 commented 7 years ago

@bfolkens I solved this problem by using a different library specifically for lda, called LightLDA. I know it's not a perfect solution but it's easy to convert data to it from python and it's written in c++ so no weird python multiprocessing issues etc. https://github.com/Microsoft/LightLDA

bfolkens commented 7 years ago

@several27 Wow, this looks promising - thanks!

pranaydeeps commented 7 years ago

@tmylk Reproduced. Looking into this.

ShT3ch commented 7 years ago

@pranaydeep-af @several27 , I am trying to fix it now. Please, add some extra meta info: what OS do u use, what python, what gensim version.

koustuvsinha commented 7 years ago

@ShT3ch this happens everytime I try to run LdaMulticore on any large dataset. OS: Linux, python 2, gensim latest version. For a dataset having 100,000 documents where each document consists of more than 10,000 words, this step takes about ~45 mins on Intel Xeon 3.5 GhZ 8 core processor. From my initial chats with people in gensim gitter channel this is due to some pre-calculation which is not properly implemented.

MaxDesiatov commented 7 years ago

reproducible for me on macOS 10.12.6, numpy 1.13.1, scipy 0.19.1, gensim 2.3.0, Cython 0.26, info output:

INFO:gensim.models.ldamodel:using symmetric alpha at 0.02
INFO:gensim.models.ldamodel:using symmetric eta at 1.66656945011541e-06
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamulticore:running online LDA training, 50 topics, 1 passes over the supplied corpus of 62552 documents, updating every 6000 documents, evaluating every ~60000 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamulticore:training LDA model using 3 processes

While it does spread some work across 3 processes initially, that quickly ends and most of the time only one process is running.

piskvorky commented 7 years ago

@menshikh-iv @ShT3ch @pranaydeep-af this looks critical -- what's the status here?

menshikh-iv commented 7 years ago

@several27 @explicitcall please load your corpus to any storage and share a link with us + attach your concrete "stuck" code, it's very needed for reproducing.

menshikh-iv commented 7 years ago

ping @several27 @explicitcall, please provide more info for reproducing a problem (corpus, code, etc).

several27 commented 7 years ago

@menskikh-iv hey, I was quiet on the issue as as stated above I switched the tool. But I’d be happy to help to resolve the issue. The thing is that I’m not sure how easy it is, as the problem here is with bigger data. If I recall correctly you were using pickling to share data between processes? That’s one thing that kept failing for me. The limit of 3gb. But directly about the issue above, I’ve provided the code in the first comment above and I can’t share the dataset. But anything with more than 2M documents and 0.5M words would do!

menshikh-iv commented 7 years ago

@several27 thanks, whats corpus format you used?

several27 commented 7 years ago

@menshikh-iv If I recall correctly just a list of documents, where each document is a list of (word_id, word_frequency) tuples.