Open ankitag2013 opened 8 years ago
The program assumes all word are represented as word_id with non-negative numbers. Can you please check your dataset to see if this condition satisfies?
Hi,
Word id in UCI dataset is greater or equal than 1. So the script text2libsvm.py minis 1 on the word id to make the word id in range [0… V-1]. But if there is already word id equal to 0 in your dataset, then the minus 1 would incur the invalid word id -1. You can modify the script or your dataset to solve the problem.
From: Ankit Agarwal [mailto:notifications@github.com] Sent: Monday, January 18, 2016 2:39 PM To: Microsoft/lightlda lightlda@noreply.github.com Cc: Fei Gao gf0109@gmail.com Subject: Re: [lightlda] Invalid table or row ids: 0 -1Segmentation fault (#20)
Hi , I have checked this -1 is getting introduced in dict and libsvm corpus by text2libsvm,py code which i am using to convert from UCI to libsvm format
— Reply to this email directly or view it on GitHub https://github.com/Microsoft/lightlda/issues/20#issuecomment-172439829 .
I ran light lda over a corpus with 1000 doc and approx 22000 vocabulary size , i used text2libsvm api present in lightlda/example to convert UCI data to libsvm and get dictionary too.
I am getting the following error
bin/dump_binary test.libsvm test.dict . 0 There are totally 6729 words in the vocabulary There are maximally totally 21405 tokens in the data set The number of tokens in the output block is: 21405 Local vocab_size for the output block is: 6720 Elapsed seconds for dump blocks: 0.0125935 root@ankit:/home/ankit/lightlda# bin/lightlda -num_vocabs 22000 -num_topics 1000 -num_iterations 1 -alpha 0.1 -beta 0.01 -num_blocks 1 -max_num_document 1000 -input_dir /home/ankit/lightlda -data_capacity 800 [INFO] [2016-01-14 18:37:21] INFO: block = 0, the number of slice = 1 [INFO] [2016-01-14 18:37:21] Server 0 starts: num_workers=1 endpoint=inproc://server [INFO] [2016-01-14 18:37:21] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1 [INFO] [2016-01-14 18:37:21] Rank 0/1: Multiverso initialized successfully. [INFO] [2016-01-14 18:37:21] Rank 0/1: Begin of configuration and initialization. [INFO] [2016-01-14 18:37:21] Rank 0/1: End of configration and initialization. [INFO] [2016-01-14 18:37:21] Rank 0/1: Begin of training. [DEBUG] [2016-01-14 18:37:21] Request params. start = 0, end = 6719 [INFO] [2016-01-14 18:37:21] Rank = 0, Iter = 0, Block = 0, Slice = 0 [INFO] [2016-01-14 18:37:21] Rank = 0, Alias Time used: 0.01 s [ERROR] [2016-01-14 18:37:21] Rank=0 Trainer=0: TrainerBase::GetTable: Invalid table or row ids: 0 -1Segmentation fault