microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 235 forks source link

How to use LightLDA in distribute mode? #40

Closed weixliu closed 7 years ago

weixliu commented 7 years ago

I just using LightLDA example in distribute mode, then command is below: mpiexec -machinefile $root/machine.list $bin/lightlda -num_vocabs 111400 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_servers 6 -num_local_workers 1 -num_blocks 1 -max_num_document 300000 -input_dir $dir -data_capacity 800 I add -machinefile params, -num_servers params, all the other params are same with nytimes.sh. When I exec the comman I got below log(error) and I don't know why.I just copy the same data to 6 machines at same position;

[INFO] [2016-11-02 16:12:44] INFO: block = 0, the number of slice = 1
[INFO] [2016-11-02 16:12:54] INFO: block = 0, the number of slice = 1
[INFO] [2016-11-02 16:12:54] INFO: block = 0, the number of slice = 1
[INFO] [2016-11-02 16:12:54] INFO: block = 0, the number of slice = 1
[INFO] [2016-11-02 16:12:54] INFO: block = 0, the number of slice = 1
[INFO] [2016-11-02 16:12:54] INFO: block = 0, the number of slice = 1
[INFO] [2016-11-02 16:12:54] Server 2 starts: num_workers=6 endpoint=inproc://server
[INFO] [2016-11-02 16:12:54] Server 4 starts: num_workers=6 endpoint=inproc://server
[INFO] [2016-11-02 16:12:44] Server 0 starts: num_workers=6 endpoint=inproc://server
[INFO] [2016-11-02 16:12:54] Server 1 starts: num_workers=6 endpoint=inproc://server
[INFO] [2016-11-02 16:12:54] Server 3 starts: num_workers=6 endpoint=inproc://server
[INFO] [2016-11-02 16:12:54] Server 5 starts: num_workers=6 endpoint=inproc://server
[INFO] [2016-11-02 16:12:44] Server 0: Worker registratrion completed: workers=6 trainers=6 servers=6
[INFO] [2016-11-02 16:12:54] Rank 4/6: Multiverso initialized successfully.
[INFO] [2016-11-02 16:12:54] Rank 1/6: Multiverso initialized successfully.
[INFO] [2016-11-02 16:12:54] Rank 5/6: Multiverso initialized successfully.
[INFO] [2016-11-02 16:12:54] Rank 2/6: Multiverso initialized successfully.
[INFO] [2016-11-02 16:12:54] Rank 3/6: Multiverso initialized successfully.
[INFO] [2016-11-02 16:12:44] Rank 0/6: Multiverso initialized successfully.
[INFO] [2016-11-02 16:12:55] Rank 3/6: Begin of configuration and initialization.
[INFO] [2016-11-02 16:12:55] Rank 4/6: Begin of configuration and initialization.
[INFO] [2016-11-02 16:12:55] Rank 2/6: Begin of configuration and initialization.
[INFO] [2016-11-02 16:12:55] Rank 1/6: Begin of configuration and initialization.
[INFO] [2016-11-02 16:12:55] Rank 5/6: Begin of configuration and initialization.
[INFO] [2016-11-02 16:12:44] Rank 0/6: Begin of configuration and initialization.
[INFO] [2016-11-02 16:13:13] Rank 3/6: End of configration and initialization.
[INFO] [2016-11-02 16:13:02] Rank 0/6: End of configration and initialization.
[INFO] [2016-11-02 16:13:13] Rank 2/6: End of configration and initialization.
[INFO] [2016-11-02 16:13:13] Rank 1/6: End of configration and initialization.
[INFO] [2016-11-02 16:13:13] Rank 5/6: End of configration and initialization.
[INFO] [2016-11-02 16:13:13] Rank 2/6: Begin of training.
[INFO] [2016-11-02 16:13:13] Rank 3/6: Begin of training.
[INFO] [2016-11-02 16:13:13] Rank 4/6: End of configration and initialization.
[INFO] [2016-11-02 16:13:02] Rank 0/6: Begin of training.
[INFO] [2016-11-02 16:13:13] Rank 1/6: Begin of training.
[INFO] [2016-11-02 16:13:13] Rank 5/6: Begin of training.
[INFO] [2016-11-02 16:13:13] Rank 4/6: Begin of training.
[DEBUG] [2016-11-02 16:13:02] Request params. start = 0, end = 101635
[DEBUG] [2016-11-02 16:13:13] Request params. start = 0, end = 101635
[DEBUG] [2016-11-02 16:13:13] Request params. start = 0, end = 101635
[DEBUG] [2016-11-02 16:13:13] Request params. start = 0, end = 101635
[DEBUG] [2016-11-02 16:13:13] Request params. start = 0, end = 101635
[DEBUG] [2016-11-02 16:13:13] Request params. start = 0, end = 101635
[INFO] [2016-11-02 16:13:16] Rank = 2, Iter = 0, Block = 0, Slice = 0
[INFO] [2016-11-02 16:13:07] Rank = 0, Iter = 0, Block = 0, Slice = 0
[INFO] [2016-11-02 16:13:17] Rank = 2, Alias Time used: 5.49 s 
[FATAL] [2016-11-02 16:13:17] Invalid topic assignment 148893078 from word proposal
[FATAL] [2016-11-02 16:13:17] Invalid topic assignment 1228263461 from word proposal
[FATAL] [2016-11-02 16:13:17] Invalid topic assignment 8397506 from word proposal
...
[INFO] [2016-11-02 16:13:18] Rank = 4, Iter = 0, Block = 0, Slice = 0
[INFO] [2016-11-02 16:13:08] Rank = 0, Alias Time used: 5.51 s 
[INFO] [2016-11-02 16:13:19] Rank = 1, Iter = 0, Block = 0, Slice = 0
[INFO] [2016-11-02 16:13:19] Rank = 5, Iter = 0, Block = 0, Slice = 0
[INFO] [2016-11-02 16:13:19] Rank = 3, Iter = 0, Block = 0, Slice = 0
[INFO] [2016-11-02 16:13:20] Rank = 4, Alias Time used: 6.07 s 
[FATAL] [2016-11-02 16:13:20] Invalid topic assignment 148893078 from word proposal
...
[FATAL] [2016-11-02 16:13:21] Invalid topic assignment 8397506 from word proposal

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@tttt05] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:0@tttt05] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@tttt05] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@tttt06] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1@tttt06] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@tttt06] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@tttt05] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@tttt05] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@tttt05] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@tttt05] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion

but nytimes.sh can be exec in single machine, so I want to know how to use LightLDA in distribute mode? thanks~

weixliu commented 7 years ago

Should I split data to six different part and use dump_binary generating six different file, then put them into different machine? Or I just put same data to different machine?

sdy1106 commented 7 years ago

See the issue down!

weixliu commented 7 years ago

yes, I see the same error [Invalid topic assignment 148893078 from word proposal], but I just want to know how to set data in distribute mode. If you know sth, I wish you can give some info.

sdy1106 commented 7 years ago

split libsvm data into blocks and use dump_binary to make block.0 , vocab.0 , vocab.0.txt using each block, then copy them to each node. Please see #38 for More information .

weixliu commented 7 years ago

thanks for the information, it's very useful. I have to split data to different machines first.