microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

Distributed running nytimes through mpi #69

Open Abigale001 opened 6 years ago

Abigale001 commented 6 years ago

After I use dump_library split the libsvm file into 2 parts, I send the block.1, vocab.1, cocab.1.txt, vocab.nytimes.txt.1 to the second node. And then I execute the command on the first node mpiexec -machinefile mpi_machine_file ../bin/lightlda -num_vocabs 111400 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 1 -num_blocks 1 -max_num_document 300000 -input_dir ./data/nytimes/ -data_capacity 800 My mpi_machine_file is 10.107.14.100 10.107.14.70 100 is the first node and 70 is the second node. But I don't think it runs correctly. Here is the log. Anyone could help?

[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. ... Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1 [INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully. ... [INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully. [INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:52] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization. [INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization. Rank 0/1: Begin of configuration and initialization. ... [INFO] [2018-04-02 03:13:14] Rank = 0, Iter = 1, Block = 0, Slice = 0 [INFO] [2018-04-02 03:13:17] Rank = 0, Alias Time used: 7.33 s [INFO] [2018-04-02 02:01:15] Rank = 0, Training Time used: 7593.99 s [INFO] [2018-04-02 02:01:16] Rank = 0, sampling throughput: 13075.037526 (tokens/thread/sec) [INFO] [2018-04-02 02:01:17] word likelihood : 5.329021e+08 [INFO] [2018-04-02 02:01:17] Normalized likelihood : -1.561648e+09 [INFO] [2018-04-02 02:01:17] Rank = 0, Evaluation Time used: 1100.00 s [DEBUG] [2018-04-02 02:01:17] Request params. start = 0, end = 101635 [INFO] [2018-04-02 02:01:46] Rank = 0, Training Time used: 7615.72 s [INFO] [2018-04-02 02:01:46] Rank = 0, sampling throughput: 13037.700213 (tokens/thread/sec) [INFO] [2018-04-02 02:01:49] word likelihood : 5.329021e+08 [INFO] [2018-04-02 02:01:49] Normalized likelihood : -1.561648e+09 [INFO] [2018-04-02 02:01:49] Rank = 0, Evaluation Time used: 969.25 s [DEBUG] [2018-04-02 02:01:49] Request params. start = 0, end = 101635 [INFO] [2018-04-02 02:02:03] Rank = 0, Training Time used: 7651.74 s [INFO] [2018-04-02 02:02:03] Rank = 0, sampling throughput: 12976.338971 (tokens/thread/sec) [INFO] [2018-04-02 03:14:18] doc likelihood : -6.422194e+08 [INFO] [2018-04-02 02:02:15] Rank = 0, Training Time used: 7662.92 s [INFO] [2018-04-02 02:02:15] Rank = 0, sampling throughput: 12957.428332 (tokens/thread/sec) [INFO] [2018-04-02 03:14:43] word likelihood : 5.329020e+08 [INFO] [2018-04-02 03:14:43] Normalized likelihood : -1.561649e+09 [INFO] [2018-04-02 03:14:43] Rank = 0, Evaluation Time used: 985.55 s [DEBUG] [2018-04-02 03:14:44] Request params. start = 0, end = 101635 [INFO] [2018-04-02 03:14:56] doc likelihood : -6.421577e+08 [INFO] [2018-04-02 03:15:26] Rank = 0, Training Time used: 8809.39 s [INFO] [2018-04-02 03:15:26] Rank = 0, sampling throughput: 11270.751379 (tokens/thread/sec) [INFO] [2018-04-02 03:15:45] word likelihood : 5.329053e+08 [INFO] [2018-04-02 03:15:45] Normalized likelihood : -1.561649e+09 [INFO] [2018-04-02 03:15:45] Rank = 0, Evaluation Time used: 1008.38 s [INFO] [2018-04-02 02:03:32] Rank = 0, Training Time used: 8703.96 s [INFO] [2018-04-02 02:03:32] Rank = 0, sampling throughput: 11407.660278 (tokens/thread/sec) [INFO] [2018-04-02 02:03:55] Rank = 0, Iter = 1, Block = 0, Slice = 0 [INFO] [2018-04-02 02:03:56] Rank = 0, Training Time used: 8760.07 s [INFO] [2018-04-02 02:03:56] Rank = 0, sampling throughput: 11334.575356 (tokens/thread/sec) [INFO] [2018-04-02 02:03:59] Rank = 0, Alias Time used: 5.45 s [INFO] [2018-04-02 02:04:07] doc likelihood : -6.422155e+08 [INFO] [2018-04-02 02:04:17] Rank = 0, Training Time used: 7770.12 s [INFO] [2018-04-02 02:04:17] Rank = 0, sampling throughput: 12778.687178 (tokens/thread/sec) [DEBUG] [2018-04-02 03:16:33] Request params. start = 0, end = 101635 [INFO] [2018-04-02 02:04:26] word likelihood : 5.328867e+08 [INFO] [2018-04-02 02:04:26] Normalized likelihood : -1.561648e+09 [INFO] [2018-04-02 02:04:26] Rank = 0, Evaluation Time used: 830.64 s [INFO] [2018-04-02 03:16:41] Rank = 0, Iter = 1, Block = 0, Slice = 0 [DEBUG] [2018-04-02 02:04:26] Request params. start = 0, end = 101635 [INFO] [2018-04-02 03:16:45] Rank = 0, Alias Time used: 7.17 s [INFO] [2018-04-02 03:16:59] Rank = 0, Training Time used: 8880.24 s [INFO] [2018-04-02 03:16:59] Rank = 0, sampling throughput: 11181.214254 (tokens/thread/sec) [INFO] [2018-04-02 02:05:02] doc likelihood : -6.421963e+08 [INFO] [2018-04-02 03:17:17] doc likelihood : -6.422194e+08 [INFO] [2018-04-02 02:05:09] Rank = 0, Training Time used: 7802.58 s [INFO] [2018-04-02 02:05:09] Rank = 0, sampling throughput: 12725.500550 (tokens/thread/sec) [INFO] [2018-04-02 02:05:17] doc likelihood : -6.421963e+08 [INFO] [2018-04-02 02:05:20] Rank = 0, Training Time used: 7819.40 s [INFO] [2018-04-02 02:05:20] [INFO] [2018-04-02 03:17:40] Rank = 0, Iter = 1, Block = 0, Slice = 0 [INFO] [2018-04-02 03:17:43] Rank = 0, Alias Time used: 5.78 s Rank = 0, sampling throughput: 12698.067672 (tokens/thread/sec) [INFO] [2018-04-02 02:05:30] word likelihood : 5.329021e+08 [INFO] [2018-04-02 02:05:30] Normalized likelihood : -1.561648e+09 [INFO] [2018-04-02 02:05:30] Rank = 0, Evaluation Time used: 515.49 s [INFO] [2018-04-02 02:05:33] Rank = 0, Training Time used: 8845.28 s [INFO] [2018-04-02 02:05:33] Rank = 0, sampling throughput: 11225.364458 (tokens/thread/sec) [DEBUG] [2018-04-02 02:05:45] Request params. start = 0, end = 101635 [INFO] [2018-04-02 03:18:07] word likelihood : 5.329020e+08 [INFO] [2018-04-02 03:18:07] Normalized likelihood : -1.561649e+09 [INFO] [2018-04-02 03:18:07] Rank = 0, Evaluation Time used: 1170.88 s [INFO] [2018-04-02 02:05:58] word likelihood : 5.329021e+08 [INFO] [2018-04-02 02:05:58] Normalized likelihood : -1.561648e+09 [INFO] [2018-04-02 02:05:58] Rank = 0, Evaluation Time used: 509.73 s ...

  1. Anyway, there is a lot of repetition, and never shows the other node, always rank0, rank0, rank0...
  2. An iteration lasts for several hours, much slower than running on just one node. But I have checked the resources status, and both nodes have used 400G memory.
stsk129 commented 6 years ago

You are probably using an external MPI. Lightlda already installs mpich2 in ~/lightlda/multiverso/third_party/bin. Make sure you link against that. Also give the number of servers in your options.

Abigale001 commented 6 years ago

Oh yes, thank you. I ignored the blank document multiverso.

Abigale001 commented 6 years ago

But after I installed the maltiverso in LightLDA, how do I know that which MPI am I using? Because there is nothing changed.