Closed SilentCC closed 5 years ago
Any answer?
root@ml-server:/data# mpiexec -f machinefile lightlda -num_vocabs 1300000 -num_topics 10000 -num_iterations 100 -alpha 0.005 -beta 0.01 -mh_steps 2 -num_local_workers 11 -num_blocks 1 -max_num_document 1400000 -input_dir /data -data_capactiy 7000 [INFO] [2018-11-06 15:32:48] Actual Model capacity: 274 MB, Alias capacity: 507 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 292 MB, Alias capacity: 507 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 319 MB, Alias capacity: 511 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 330 MB, Alias capacity: 512 MB, Delta capacity: 250MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 337 MB, Alias capacity: 512 MB, Delta capacity: 242MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 344 MB, Alias capacity: 512 MB, Delta capacity: 239MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 350 MB, Alias capacity: 512 MB, Delta capacity: 228MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 345 MB, Alias capacity: 512 MB, Delta capacity: 229MB [INFO] [2018-11-06 15:32:48] INFO: block = 0, the number of slice = 9 [INFO] [2018-11-06 15:32:48] Actual Model capacity: 275 MB, Alias capacity: 508 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 293 MB, Alias capacity: 508 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 320 MB, Alias capacity: 512 MB, Delta capacity: 255MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 330 MB, Alias capacity: 512 MB, Delta capacity: 248MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 337 MB, Alias capacity: 512 MB, Delta capacity: 242MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 344 MB, Alias capacity: 512 MB, Delta capacity: 239MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 350 MB, Alias capacity: 512 MB, Delta capacity: 227MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 345 MB, Alias capacity: 512 MB, Delta capacity: 229MB [INFO] [2018-11-06 15:32:48] INFO: block = 0, the number of slice = 9 [INFO] [2018-11-06 15:32:48] Actual Model capacity: 274 MB, Alias capacity: 507 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 293 MB, Alias capacity: 507 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 320 MB, Alias capacity: 511 MB, Delta capacity: 256MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 330 MB, Alias capacity: 512 MB, Delta capacity: 250MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 337 MB, Alias capacity: 512 MB, Delta capacity: 242MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 344 MB, Alias capacity: 512 MB, Delta capacity: 238MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 350 MB, Alias capacity: 512 MB, Delta capacity: 227MB [INFO] [2018-11-06 15:32:48] Actual Model capacity: 345 MB, Alias capacity: 512 MB, Delta capacity: 229MB [INFO] [2018-11-06 15:32:48] INFO: block = 0, the number of slice = 9 [INFO] [2018-11-06 15:32:48] Server 0 starts: num_workers=3 endpoint=inproc://server [INFO] [2018-11-06 15:32:48] Server 2 starts: num_workers=3 endpoint=inproc://server [INFO] [2018-11-06 15:32:48] Server 1 starts: num_workers=3 endpoint=inproc://server [INFO] [2018-11-06 15:32:48] Server 0: Worker registratrion completed: workers=3 trainers=33 servers=3 [INFO] [2018-11-06 15:32:48] Rank 1/3: Multiverso initialized successfully. [INFO] [2018-11-06 15:32:48] Rank 0/3: Multiverso initialized successfully. [INFO] [2018-11-06 15:32:48] Rank 2/3: Multiverso initialized successfully. [FATAL] [2018-11-06 15:32:48] Rank 1: corpus_size_ > memory_block_size when reading file /data/block.0 [FATAL] [2018-11-06 15:32:48] Rank 2: corpus_size_ > memory_block_size when reading file /data/block.0 [FATAL] [2018-11-06 15:32:48] Rank 0: corpus_size_ > memory_block_size when reading file /data/block.0 [INFO] [2018-11-06 15:32:49] Rank 2/3: Begin of configuration and initialization. [INFO] [2018-11-06 15:32:49] Rank 0/3: Begin of configuration and initialization. [INFO] [2018-11-06 15:32:49] Rank 1/3: Begin of configuration and initialization. =================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = EXIT CODE: 11 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES =================================================================================== [proxy:0:0@ml-server] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed [proxy:0:0@ml-server] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0@ml-server] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [proxy:0:2@ml-server002] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed [proxy:0:2@ml-server002] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:2@ml-server002] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec@ml-server] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting [mpiexec@ml-server] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion [mpiexec@ml-server] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion [mpiexec@ml-server] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
This is because the value of data_capacity too small .
data_capacity
See the code in data_block.cpp
memory_block_size_ = Config::data_capacity / sizeof(int32_t);
Any answer?