larsmaaloee / deep-belief-nets-for-topic-modeling

This repository is a proof of concept toolbox for using Deep Belief Nets for Topic Modeling in Python.
143 stars 58 forks source link

Main exits with IOErrors #1

Open simNN7 opened 10 years ago

simNN7 commented 10 years ago

Hi Lars, thanks for the toolbox. I am having a hard time getting it to run though. Is the main.py supposed to be working `as is' or only with modifications? I downloaded the data set, changed all formats to .txt but running it (on an IMac with 10.9.5) returns

Traceback (most recent call last): File "main.py", line 67, in run_simulation('input/20news-bydate/20news-bydate-train','input/20news-bydate/20news-bydate-test',epochs = 50,attributes=2000,evaluation_points=[1,3,7,15,31,63],binary_output=True) File "main.py", line 46, in run_simulation dat_proc_train = data_processing.DataProcessing(train_paths,words_count=attributes,trainingset_size=1.0,acceptance_lst_path="input/acceptance_lst_stemmed.txt") File "/Users/admin/Desktop/Deep-Belief-Nets-for-Topic-Modeling-master/DataPreparation/data_processing.py", line 42, in init self.acceptance_lst = open(acceptance_lst_path).read().replace(" ","").split("\n") IOError: [Errno 2] No such file or directory: 'input/acceptance_lst_stemmed.txt'

Removing the 'acceptance_lst_path' from `dat_proc_train = data_processing.DataProcessing...' (as in )results in

Traceback (most recent call last): File "main.py", line 67, in run_simulation('input/20news-bydate/20news-bydate-train','input/20news-bydate/20news-bydate-test',epochs = 50,attributes=2000,evaluation_points=[1,3,7,15,31,63],binary_output=True) File "main.py", line 52, in run_simulation dat_proc_test = data_processing.DataProcessing(test_paths,trainingset_size=0.0, trainingset_attributes=data_processing.get_attributes()) File "/Users/admin/Desktop/Deep-Belief-Nets-for-Topic-Modeling-master/DataPreparation/data_processing.py", line 437, in get_attributes return s.load( open( env_paths.get_attributes_path(training), "rb" ) ) IOError: [Errno 2] No such file or directory: 'output/train/BOWs/attributes.p'

larsmaaloee commented 9 years ago

Hello gents. Thank you very much for your interest in the toolbox. There are two cases in which errors occur:

  1. You are missing an acceptance list (white words). If you do not want to use one such, then don't.
  2. As the parameters you have set the traniningset_size to 0.0, which means that it is trying to generate the testing set. This is not possible though, since the training set has not been generated yet. So please set the training set size to an appropriate level, such as 0.7 (70 %) if all documents are collected in one.

During the next days I will take a look at making the training easier to understand. I can see why this confuse you a lot. But the way that it has been setup now leaves many ways of generating the dataset, which is very handy when doing scientific analysis of different DBNs. Let me know if this helps you getting started using the toolbox?

Best regards Lars

karenkua commented 9 years ago

Hi Lars,

Having the same issue as Vamsi-lg, could you help? Thanks!

larsmaaloee commented 9 years ago

Hello Karenkua and Vamsi-Ig. The attribute list (saved as serialised file attribute.p) must be generated in the data preparation by the def "__set_attributes" as a part of the generation of the training set. This is the list of words for the BOW. Please let me know how that works.

karenkua commented 9 years ago

Hi Lars, thanks for the prompt reply, greatly appreciated. I tried the following steps and different problems arise.

  1. In main.py, I uncommented: dat_proc_train.generate_bows() to generate the BOW.
  2. In __read_docs_from_filesystem of data_processing.py there's a if loop (shown below) checking if filenames end with .p. I commented the if loop given that the data files downloaded from 20 Newsgroups website are not in .p format

for doc in docs:

if doc.endswith('.p'):

  1. However under "print 'Reading and saving docs from file system'" section of data_processing, docs_list = False for all files.

I assume that has something to do with the .p files in step 2, could you kindly advise if the input files have to be in .p format (any code for converting or source for downloading the files). I got mine from http://qwone.com/~jason/20Newsgroups/ as mentioned in README file. Or how could I go around this problem. Thanks again!

larsmaaloee commented 9 years ago

Hi again. So now I have made various amendments to the toolbox so that it should be much clearer what needs to be done. Please read the README.md file and follow the 3 examples. That should do the job to get you up-and-running on using the toolbox.

jyb002 commented 9 years ago

Hi Lars,

Thanks for your toolbox. I am trying to run your code, but it has the exception, which says "deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp return 1. / (1 + exp(-x))".

I googled solutions to fix it such as replacing the sigmoid function in dbn.py:241 with "return expit(x)" or "return .5 * (1+ than(.5 * x))". But neither of these changes works.

Do your have the same issue when your run the toolbox? And do you have any idea to solve it? Thx.

The details of the exception are shown as follows:

Pre Training Visible units: 2000 Hidden units: 500 /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp   return 1. / (1 + exp(-x)) /deep-belief-nets-for-topic-modeling/DBN/pretraining.py:140: RuntimeWarning: divide by zero encountered in log   perplexity = nansum(vis * log(softmax_value)) /deep-belief-nets-for-topic-modeling/DBN/pretraining.py:140: RuntimeWarning: invalid value encountered in multiply   perplexity = nansum(vis * log(softmax_value)) Bottom units: 500 Top units: 500 Epoch[ 1]: Error = 1.7385879 Bottom units: 500 Output units: 128 Epoch[ 1]: Error = 32.1944861 Time  71.8855669498 Fine Tuning Backprop: Epoch 1 Large batch: 1 of 36 /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp   return 1. / (1 + exp(-x)) /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp   return 1. / (1 + exp(-x)) /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp   return 1. / (1 + exp(-x))

larsmaaloee commented 9 years ago

Hi,

This happens because of numbers being too small. You’ll need to scale the data accordingly. But for most overflow warnings, they don’t have a real influence on the training.

Let me know if you have any more questions?

Best regards


Lars Maaløe PHD Student DTU Compute Technical University of Denmark (DTU)

Email: lars.maaloe@gmail.com, larsma@dtu.dk Phone: 0045 2229 1010 Skype: lars.maaloe LinkedIn http://dk.linkedin.com/in/larsmaaloe DTU Orbit http://orbit.dtu.dk/en/persons/lars-maaloee(0ba00555-e860-4036-9d7b-01ec1d76f96d).html

On 26 Feb 2015, at 17:32, jyb002 notifications@github.com wrote:

Hi Lars,

Thanks for your toolbox. I am trying to run your code, but it has the exception, which says "deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp return 1. / (1 + exp(-x))".

I googled solutions to fix it such as replacing the sigmoid function in dbn.py:241 with "return expit(x)" or "return .5 * (1+ than(.5 * x))". But neither of these changes works.

Do your have the same issue when your run the toolbox? And do you have any idea to solve it? Thx.

The details of the exception are shown as follows:

Pre Training Visible units: 2000 Hidden units: 500 /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp return 1. / (1 + exp(-x)) /deep-belief-nets-for-topic-modeling/DBN/pretraining.py:140: RuntimeWarning: divide by zero encountered in log perplexity = nansum(vis * log(softmax_value)) /deep-belief-nets-for-topic-modeling/DBN/pretraining.py:140: RuntimeWarning: invalid value encountered in multiply perplexity = nansum(vis * log(softmax_value)) Bottom units: 500 Top units: 500 Epoch[ 1]: Error = 1.7385879 Bottom units: 500 Output units: 128 Epoch[ 1]: Error = 32.1944861 Time 71.8855669498 Fine Tuning Backprop: Epoch 1 Large batch: 1 of 36 /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp return 1. / (1 + exp(-x)) /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp return 1. / (1 + exp(-x)) /deep-belief-nets-for-topic-modeling/DBN/dbn.py:241: RuntimeWarning: overflow encountered in exp return 1. / (1 + exp(-x))

— Reply to this email directly or view it on GitHub https://github.com/larsmaaloee/deep-belief-nets-for-topic-modeling/issues/1#issuecomment-76210585.

AmrAzzam commented 9 years ago

Hi Dr Lars, I would like to thank you for publishing your code . I am trying to run your code I am facing some issue 1) there is a problem with the unpickle that some of the dataset files does not work with it so I removed this files from the data set 2) The parallel Stemming does not work .. it creates the files but if I tried to open this files I just find a array of boolean values False .. I am using Windows 7 64

[False, False, False, False, False, False, False, <type 'exceptions.StopIteration'>, False, False, <type 'exceptions.StopIteration'>, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, <type 'exceptions.StopIteration'>, <type 'exceptions.StopIteration'>, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, False, <type 'exceptions.StopIteration'>, False, False, False, False, False, False, <type 'exceptions.StopIteration'>, False]

alexminnaar commented 9 years ago

There are definitely some problems with the importing the newsgroup training data

1) Your code looks for files that end in ".p" however the newsgroup files are ".txt" files. 2) When you change the code to look for ".txt" files, there are still some pickling errors that occur with some files. 3) When you get rid of the files with pickling errors, the docs_list list in __set_attributes() contains all false values.

Have you tested this? Didn't you run into the ".p" problem?

larsmaaloee commented 9 years ago

Hi Alex,

Thanks for you interest in the toolbox.

The code is a little outdated, but there are no problems in running the code. The pickled files, are temporary lists of words, used for later BOW creation. You should not change the code to look for the .txt files. I believe what you are missing is the stemming. Please stem the files and then create the BOW, as is in the example code.

Let me know how it works. :)

Best regards


Lars Maaløe PHD Student Cognitive Systems, DTU Compute Technical University of Denmark (DTU)

Email: lars.maaloe@gmail.com, larsma@dtu.dk Phone: 0045 2229 1010 Skype: lars.maaloe LinkedIn http://dk.linkedin.com/in/larsmaaloe DTU Orbit http://orbit.dtu.dk/en/persons/lars-maaloee(0ba00555-e860-4036-9d7b-01ec1d76f96d).html

On 29 Sep 2015, at 02:18, Alex Minnaar notifications@github.com wrote:

There are definitely some problems with the importing the newsgroup training data

1) Your code looks for files that end in ".p" however the newsgroup files are ".txt" files. 2) When you change the code to look for ".txt" files, there are still some pickling errors that occur with some files. 3) When you get rid of the files with pickling errors, the docs_list list in __set_attributes() contains all false values.

Have you tested this? Didn't you run into the ".p" problem?

— Reply to this email directly or view it on GitHub https://github.com/larsmaaloee/deep-belief-nets-for-topic-modeling/issues/1#issuecomment-143908957.

alexminnaar commented 9 years ago

Apologies. The problem was that I did not have nltk installed for the stemming. Strangely the error did not say that I did not have nltk, instead it seemed to just skip stemming altogether which is what created the error associated with not creating any ".p" files. It seems to be working now. Thanks!