microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 235 forks source link

terminate called after throwing an instance of 'zmq::error_t' #79

Open ericxsun opened 5 years ago

ericxsun commented 5 years ago

I started the training routine in single machine with 256G memory and 40 core cpus. After 23 iterations, it was killed.

[INFO] [2019-04-18 15:15:54] Rank = 0, Training Time used: 824.51 s
[INFO] [2019-04-18 15:15:54] Rank = 0, sampling throughput: 675.029212 (tokens/thread/sec)
[INFO] [2019-04-18 15:15:55] Rank = 0, Iter = 23, Block = 0, Slice = 94
[DEBUG] [2019-04-18 15:15:55] Request params. start = 636142, end = 958620
[INFO] [2019-04-18 15:15:55] Rank = 0, Alias Time used: 1.36 s
Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 591: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 0
terminate called after throwing an instance of 'zmq::error_t'
  what():  Context was terminated

My configuration:

num_vocabs: 833w
num_topics: 10w
num_iterations=100
alpha=0.0005
beta=0.01
mh_stpes=2
num_local_workers=30
num_blocks=1
max_num_document=13916w (block size 71G in lightLDA format)
data_capacity=73G

So what is the problem? Could anyone help me? Thanks very much.