microsoft / LightLDA

Scalable, fast, and lightweight system for large-scale topic modeling
http://www.dmtk.io
MIT License
840 stars 235 forks source link

Invalid table or row ids: 0 -1Segmentation fault #20

Open ankitag2013 opened 8 years ago

ankitag2013 commented 8 years ago

I ran light lda over a corpus with 1000 doc and approx 22000 vocabulary size , i used text2libsvm api present in lightlda/example to convert UCI data to libsvm and get dictionary too.

I am getting the following error

bin/dump_binary test.libsvm test.dict . 0 There are totally 6729 words in the vocabulary There are maximally totally 21405 tokens in the data set The number of tokens in the output block is: 21405 Local vocab_size for the output block is: 6720 Elapsed seconds for dump blocks: 0.0125935 root@ankit:/home/ankit/lightlda# bin/lightlda -num_vocabs 22000 -num_topics 1000 -num_iterations 1 -alpha 0.1 -beta 0.01 -num_blocks 1 -max_num_document 1000 -input_dir /home/ankit/lightlda -data_capacity 800 [INFO] [2016-01-14 18:37:21] INFO: block = 0, the number of slice = 1 [INFO] [2016-01-14 18:37:21] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2016-01-14 18:37:21] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2016-01-14 18:37:21] Rank 0/1: Multiverso initialized successfully. [INFO] [2016-01-14 18:37:21] Rank 0/1: Begin of configuration and initialization. [INFO] [2016-01-14 18:37:21] Rank 0/1: End of configration and initialization. [INFO] [2016-01-14 18:37:21] Rank 0/1: Begin of training. [DEBUG] [2016-01-14 18:37:21] Request params. start = 0, end = 6719 [INFO] [2016-01-14 18:37:21] Rank = 0, Iter = 0, Block = 0, Slice = 0 [INFO] [2016-01-14 18:37:21] Rank = 0, Alias Time used: 0.01 s [ERROR] [2016-01-14 18:37:21] Rank=0 Trainer=0: TrainerBase::GetTable: Invalid table or row ids: 0 -1Segmentation fault

feiga commented 8 years ago

The program assumes all word are represented as word_id with non-negative numbers. Can you please check your dataset to see if this condition satisfies?

feiga commented 8 years ago

Hi,

Word id in UCI dataset is greater or equal than 1. So the script text2libsvm.py minis 1 on the word id to make the word id in range [0… V-1]. But if there is already word id equal to 0 in your dataset, then the minus 1 would incur the invalid word id -1. You can modify the script or your dataset to solve the problem.

From: Ankit Agarwal [mailto:notifications@github.com] Sent: Monday, January 18, 2016 2:39 PM To: Microsoft/lightlda lightlda@noreply.github.com Cc: Fei Gao gf0109@gmail.com Subject: Re: [lightlda] Invalid table or row ids: 0 -1Segmentation fault (#20)

Hi , I have checked this -1 is getting introduced in dict and libsvm corpus by text2libsvm,py code which i am using to convert from UCI to libsvm format

— Reply to this email directly or view it on GitHub https://github.com/Microsoft/lightlda/issues/20#issuecomment-172439829 .