microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
842 stars 234 forks source link

Core dump when reading data blocks #12

Closed yangs16 closed 8 years ago

yangs16 commented 8 years ago

I can run the nytimes example successfully. But on my own dataset, it failed with the following messages:

[INFO] [2015-11-20 16:26:13] INFO: block = 0, the number of slice = 1 [INFO] [2015-11-20 16:26:14] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2015-11-20 16:26:14] Server 0: Worker registratrion completed: workers=1 trainers=4 servers=1 [INFO] [2015-11-20 16:26:14] Rank 0/1: Multiverso initialized successfully. [INFO] [2015-11-20 16:26:14] Rank 0/1: Begin of configuration and initialization. foot.sh: line 13: 26600 Segmentation fault (core dumped) $bin/lightlda -num_vocabs 99948 -num_topics 50 -num_iterations 50 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 4 -num_blocks 1 -max_num_document 382578 -input_dir $dir -data_capacity 800

The program exited during processing the docs in the data blocks.

Any thought? Thanks a lot.

feiga commented 8 years ago

@yangs16

The data is in binary format. The error seems caused by array index out of ranges. Can you check your data preprocessing. Usually this is caused by the invalid data format.

yangs16 commented 8 years ago

Thanks Fei. The problem is solved. There are some terms with 0 TF in the word_id.dict file.

feiga commented 8 years ago

@yangs16 Thanks! I will add this boundary check to avoid such unexpected crash.