Open ericxsun opened 5 years ago
I started the training routine in single machine with 256G memory and 40 core cpus. After 23 iterations, it was killed.
[INFO] [2019-04-18 15:15:54] Rank = 0, Training Time used: 824.51 s [INFO] [2019-04-18 15:15:54] Rank = 0, sampling throughput: 675.029212 (tokens/thread/sec) [INFO] [2019-04-18 15:15:55] Rank = 0, Iter = 23, Block = 0, Slice = 94 [DEBUG] [2019-04-18 15:15:55] Request params. start = 636142, end = 958620 [INFO] [2019-04-18 15:15:55] Rank = 0, Alias Time used: 1.36 s Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 591: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO internal ABORT - process 0 terminate called after throwing an instance of 'zmq::error_t' what(): Context was terminated
My configuration:
num_vocabs: 833w num_topics: 10w num_iterations=100 alpha=0.0005 beta=0.01 mh_stpes=2 num_local_workers=30 num_blocks=1 max_num_document=13916w (block size 71G in lightLDA format) data_capacity=73G
So what is the problem? Could anyone help me? Thanks very much.
I started the training routine in single machine with 256G memory and 40 core cpus. After 23 iterations, it was killed.
My configuration:
So what is the problem? Could anyone help me? Thanks very much.