microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

error occur in Nemesis Network Module #63

Open 1234clam opened 6 years ago

1234clam commented 6 years ago
Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 591: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 0
terminate called recursively
terminate called after throwing an instance of 'zmq::error_t'
  what():  Context was terminated
[DEBUG] [2017-11-22 04:38:54] Request params. start = 1932613, end = 6789061

when I train the LightLDA on cluster with five workers. It throws an instance of zmq::error_t several times. But the project not provide enough information about it. Thank you! @feiga

rockyzhengwu commented 6 years ago

I have the same problem

[INFO] [2018-06-13 16:01:31] Rank = 0, Evaluation Time used: 61.05 s
 [DEBUG] [2018-06-13 16:01:31] Request params. start = 0, end = 108602
 [INFO] [2018-06-13 16:01:32] Rank = 0, Iter = 2926, Block = 0, Slice = 0
 [INFO] [2018-06-13 16:01:32] Rank = 0, Alias Time used: 0.78 s
  [INFO] [2018-06-13 16:01:34] Rank = 0, Training Time used: 33.29 s 
 [INFO] [2018-06-13 16:01:34] Rank = 0, sampling throughput: 21112.816283 (tokens/thread/sec)  [DEBUG] [2018-06-13 16:01:34] Request params. start = 0, end = 108602
 [INFO] [2018-06-13 16:01:35] Rank = 0, Iter = 2927, Block = 0, Slice = 0
 [INFO] [2018-06-13 16:01:36] Rank = 0, Alias Time used: 0.87 s  
[INFO] [2018-06-13 16:01:37] Rank = 0, Training Time used: 32.86 s 
 [INFO] [2018-06-13 16:01:37] Rank = 0, sampling throughput: 21386.143085 (tokens/thread/sec)  [DEBUG] [2018-06-13 16:01:38] Request params. start = 0, end = 108602
 Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 591: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO 
internal ABORT - process 0 terminate called after throwing an instance of 'zmq::error_t'   
what():  Context was terminated ./train_lightlda.sh: line 9: 62468 Aborted                 (core dumped) 
xzyin commented 6 years ago
internal ABORT - process 0 terminate called after throwing an instance of 'zmq::error_t'

the problem occur on the process 0, when you start your train work, different workers will get diffcult process id the process 0 is the first machine in the machine list.

The cause of this problem may be that your first machine which in you machine list is running out of memory

rockyzhengwu commented 6 years ago

@xzyin I'm sure there is enough memory. I think this error cannot recur. because never meet this again.