microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 235 forks source link

Invalid topic assignment N from word proposal , something different compared with the former issues #38

Closed sdy1106 closed 7 years ago

sdy1106 commented 7 years ago

I have the same problem of Invalid topic assignment N from word proposal as the former issues .The former issues all use their owns corpus with the mistake of the TF of words, But I use the example corpus of nytimes . When I run it by only one machine , it runs well. When I run it by two or four machines using MPI , it will get into the trouble of Invalid topic assignment N from word proposal . If you can tell me why , I will appreciate it very much.

the following is my input order: OMP_NUM_THREADS=1 mpirun -n 2 -perhost 1 -host node1,node2 $bin/lightlda -num_vocabs 111400 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 1 -num_blocks 1 -max_num_document 300000 -input_dir $dir -data_capacity 800

the following is the logging info:

/home/danyang/mfs/lightLDA/example [INFO] [2016-09-29 12:49:01] INFO: block = 0, the number of slice = 1 [INFO] [2016-09-29 12:49:01] INFO: block = 0, the number of slice = 1 [INFO] [2016-09-29 12:49:01] Server 0 starts: num_workers=2 endpoint=inproc://server [INFO] [2016-09-29 12:49:01] Server 1 starts: num_workers=2 endpoint=inproc://server [INFO] [2016-09-29 12:49:01] Server 0: Worker registratrion completed: workers=2 trainers=2 servers=2 [INFO] [2016-09-29 12:49:01] Rank 0/2: Multiverso initialized successfully. [INFO] [2016-09-29 12:49:01] Rank 1/2: Multiverso initialized successfully. [INFO] [2016-09-29 12:49:09] Rank 1/2: Begin of configuration and initialization. [INFO] [2016-09-29 12:49:11] Rank 0/2: Begin of configuration and initialization. [INFO] [2016-09-29 12:49:34] Rank 0/2: End of configration and initialization. [INFO] [2016-09-29 12:49:34] Rank 1/2: End of configration and initialization. [INFO] [2016-09-29 12:49:34] Rank 0/2: Begin of training. [INFO] [2016-09-29 12:49:34] Rank 1/2: Begin of training. [DEBUG] [2016-09-29 12:49:34] Request params. start = 0, end = 101635 [DEBUG] [2016-09-29 12:49:34] Request params. start = 0, end = 101635 [INFO] [2016-09-29 12:49:37] Rank = 0, Iter = 0, Block = 0, Slice = 0 [INFO] [2016-09-29 12:49:38] Rank = 1, Iter = 0, Block = 0, Slice = 0 [INFO] [2016-09-29 12:49:40] Rank = 0, Alias Time used: 6.88 s [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 13989076 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 25504855 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 341954549 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 510688998 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 1604518467 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 72048999 from word proposal [INFO] [2016-09-29 12:49:40] Rank = 1, Alias Time used: 7.31 s [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 13989076 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 25504855 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 341954549 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 510688998 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 1604518467 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 72048999 from word proposal [FATAL] [2016-09-29 12:49:40] Invalid topic assignment 324590367 from word proposal

if my way of running it by multiple machine is wrong , please show me a right example.Thanks very much!

sdy1106 commented 7 years ago

any answer?

feiga commented 7 years ago

@sdy1106 Sorry for the late. I suppose you run two machines, each of which contains the whole data. Then the training data shall be twice than the original one, which causes the number of TF double too. Then the wrong TF causes the unexpected error. If you run with two machine with data-parallel way, you need to partition the dataset firstly.

sdy1106 commented 7 years ago

@feiga Thank you for your reply. I'm sure that I've done the data partition. I splite the libsvm file to two parts and get block.0 and block.1 by dump_binary successfully. When I run it , I think block.0 should be dealed by rank0 and block.1 should be dealed by rank1. But in a iteration, rank0 and rank1 both deal block.0 and block.1 which I think is unreasonable.

This is my input:

mpirun -np 2 $bin/lightlda -num_servers 2 -num_vocabs 111400 -num_topics 1000 -num_iterations 10 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 2 -num_blocks 2 -max_num_document 1000000 -input_dir $dir -data_capacity 800

someting wrong ?

feiga commented 7 years ago

This is the MPI problem. You're running two processes in one machine. Both processes run with same command line arguments. That's why they deal with both blocks. Please also note that you're not necessary to run multiple processes in one machine, since LightLDA has done optimization for multiple threads. Just use one process for one machine.

If you would like to try distributed training. Then you need to partition the data and distribute the data to different machines. Then run mpirun -machinefile machinelist.txt lightlda ... -num_blocks num_of_local_blocks

I think this would solve your problems.

sdy1106 commented 7 years ago

@feiga I've tried your solution and it also failed. I used the -machinefile and edited two ip in it. I'm a little confused about "partition the data and distribute the data to different machines". I've done the data partition in a share file system and I'm sure every node can access it , should I do something else? How do I make sure that specific node access specific block and vocab?

Debugging log below:

[INFO] [2016-11-02 16:06:29] INFO: block = 0, the number of slice = 1 [INFO] [2016-11-02 16:06:29] INFO: block = 1, the number of slice = 1 before multiverso init! Multiverso init! [INFO] [2016-11-02 16:06:30] INFO: block = 0, the number of slice = 1 [INFO] [2016-11-02 16:06:30] INFO: block = 1, the number of slice = 1 before multiverso init! Multiverso init! [INFO] [2016-11-02 16:06:30] Server 1 starts: num_workers=2 endpoint=inproc://server [INFO] [2016-11-02 16:06:30] Server 0 starts: num_workers=2 endpoint=inproc://server MPI !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! MPI !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! [INFO] [2016-11-02 16:06:30] Server 0: Worker registratrion completed: workers=2 trainers=4 servers=2 [INFO] [2016-11-02 16:06:30] Rank 1/2: Multiverso initialized successfully. after multiverso init! [INFO] [2016-11-02 16:06:30] Rank 0/2: Multiverso initialized successfully. after multiverso init! [INFO] [2016-11-02 16:06:31] Rank 1/2: Begin of configuration and initialization. [INFO] [2016-11-02 16:06:32] Rank 0/2: Begin of configuration and initialization. [INFO] [2016-11-02 16:07:04] Rank 1/2: End of configration and initialization. [INFO] [2016-11-02 16:07:04] Rank 1/2: Begin of training. [INFO] [2016-11-02 16:07:04] Rank 0/2: End of configration and initialization. [INFO] [2016-11-02 16:07:04] Rank 0/2: Begin of training. [INFO] [2016-11-02 16:07:06] Rank = 1, Iter = 0, Block = 0, Slice = 0 [INFO] [2016-11-02 16:07:08] Rank = 1, Alias Time used: 4.65 s [FATAL] [2016-11-02 16:07:08] Invalid topic assignment 519631042 from word proposal [FATAL] [2016-11-02 16:07:08] Invalid topic assignment 1858367906 from word proposal [FATAL] [2016-11-02 16:07:08] Invalid topic assignment 2140424801 from word proposal

feiga commented 7 years ago

Don't use a shared file system, use the local disk instead. Each machine store one block, and both blocks is named with block.0. The num_block equals to 1

sdy1106 commented 7 years ago

OK! It works! Thanks!

qinghua2016 commented 7 years ago

Hi,did you install mpicc in both computers? what's the machine_file about? Each machine have the same dir to store the training file? could you tell me the details how you succeed to run distributed training?@ sdy1106