Open several27 opened 7 years ago
Same here: workers=40, num_topics=10000 Corpus size: 112M
@bfolkens I solved this problem by using a different library specifically for lda, called LightLDA. I know it's not a perfect solution but it's easy to convert data to it from python and it's written in c++ so no weird python multiprocessing issues etc. https://github.com/Microsoft/LightLDA
@several27 Wow, this looks promising - thanks!
@tmylk Reproduced. Looking into this.
@pranaydeep-af @several27 , I am trying to fix it now. Please, add some extra meta info: what OS do u use, what python, what gensim version.
@ShT3ch this happens everytime I try to run LdaMulticore on any large dataset. OS: Linux, python 2, gensim latest version. For a dataset having 100,000 documents where each document consists of more than 10,000 words, this step takes about ~45 mins on Intel Xeon 3.5 GhZ 8 core processor. From my initial chats with people in gensim gitter channel this is due to some pre-calculation which is not properly implemented.
reproducible for me on macOS 10.12.6, numpy 1.13.1, scipy 0.19.1, gensim 2.3.0, Cython 0.26, info output:
INFO:gensim.models.ldamodel:using symmetric alpha at 0.02
INFO:gensim.models.ldamodel:using symmetric eta at 1.66656945011541e-06
INFO:gensim.models.ldamodel:using serial LDA version on this node
INFO:gensim.models.ldamulticore:running online LDA training, 50 topics, 1 passes over the supplied corpus of 62552 documents, updating every 6000 documents, evaluating every ~60000 documents, iterating 50x with a convergence threshold of 0.001000
INFO:gensim.models.ldamulticore:training LDA model using 3 processes
While it does spread some work across 3 processes initially, that quickly ends and most of the time only one process is running.
@menshikh-iv @ShT3ch @pranaydeep-af this looks critical -- what's the status here?
@several27 @explicitcall please load your corpus to any storage and share a link with us + attach your concrete "stuck" code, it's very needed for reproducing.
ping @several27 @explicitcall, please provide more info for reproducing a problem (corpus, code, etc).
@menskikh-iv hey, I was quiet on the issue as as stated above I switched the tool. But I’d be happy to help to resolve the issue. The thing is that I’m not sure how easy it is, as the problem here is with bigger data. If I recall correctly you were using pickling to share data between processes? That’s one thing that kept failing for me. The limit of 3gb. But directly about the issue above, I’ve provided the code in the first comment above and I can’t share the dataset. But anything with more than 2M documents and 0.5M words would do!
@several27 thanks, whats corpus format you used?
@menshikh-iv If I recall correctly just a list of documents, where each document is a list of (word_id, word_frequency) tuples.
Hi, I'm having problem with running
LdaMulticore
on my dataset.My corpus size is about 2 million, with 0.5 million unique words. I'm running it on EC2 r4.4xlarge. When I lower the number of tags (to 100) is seems to work fine and after the last info message it starts up in usually in 10 minutes, but if I don't it just gets stuck on (I was waiting more than two hours).
What's happening after displaying this last info message above? Is there any way of additionally debugging it? I tried using python debugger and it seems to be stuck at
self.expElogbeta = np.exp(dirichlet_expectation(self.state.sstats))
at line 329 in the ldamodel.py file, but it might have been something else.Also, when I terminated the script after two hours, I got at least thousand messages like: