mimno / Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
https://mimno.github.io/Mallet/
Other
973 stars 346 forks source link

Parallel problem in MALLET LDA (gensim wrapper) #176

Open thisray opened 4 years ago

thisray commented 4 years ago

Hi,

I use the gensim wrapper, LdaMallet() [link], to run MALLET.

Gensim library provide a parameter workers to assign the --num-threads argument in MALLET.
(Ref: Gensim Code - line274)

But I found the workers seems not working, here is the different setting and running time:

 `workers=1` -> run time: 7.32 sec   # <--
 `workers=2` -> run time: 2min 25s
 `workers=4` -> run time: 2min 38s
 `workers=16` -> run time: 3min 13s  # <--

No matter I run this on my computer:

openjdk version "1.8.0_162"
OpenJDK Runtime Environment (build 1.8.0_162-8u162-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.162-b12, mixed mode)

or on the Colab:

openjdk version "11.0.4" 2019-07-16
OpenJDK Runtime Environment (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3)
OpenJDK 64-Bit Server VM (build 11.0.4+11-post-Ubuntu-1ubuntu218.04.3, mixed mode, sharing)

the results are similar, more workers spent more time. (and I have also tried mallet-2.0.8 & mallet-2.0.7)

Dose it means I am not using a proper way to run MALLET LDA in parallel?

Thanks!


reference code:

# code in gensim (python)
# (i tried with different `workers`)

workers = 16
gensim.models.wrappers.LdaMallet(mallet_path, corpus=corpus, num_topics=num_topics, id2word=id2word, 
                                 optimize_interval=1, iterations=6000, workers=workers)
# the equivalent commands in mallet (key in shell, ignore the I/O setting):

$ bin/mallet train-topics --num-threads 16
patelamalk commented 4 years ago

I have the same problem, for 12077 files ~ 5 Gb it takes 4hrs. It doesn't seem to be utilizing all the cores.

mimno commented 4 years ago

Unless this can be replicated in the java-only version there's not much to do here -- I'd check with gensim.

d0nghyunkang commented 3 years ago

@thisray This thread has been dormant for a while, but have you checked how many cores/threads you have in your computer? It could be that your number of cores/threads are less than 16, so 16 slows you down.