Python toolbox using deep belief nets (DBN) for running topic modeling on document data. The concept of the method is to load bag-of-words (BOW) and produce a strong latent representation that will then be used for a content based recommender system.
The toolbox is written for a M.Sc. thesis project. For a shorter read we urge you to read the article Deep Belief Nets for Topic Modeling accepted at the ICML2014 workshop Knowledge-Powered Deep Learning for Text Mining (KPDLTM).
The toolbox is tested to run on Windows 7, Ubuntu 14.04.1 and OSX 10.8-10. You need following prerequisite packages: nltk, numpy, scipy, scikit-learn and matplotlib installed on your system before running the toolbox. If you are interested in producing 3D plots of the output space you will need to install MENCODER and FFMPEG (only tested on OSX).
In the main.py python file you will find 3 examples on how to run the toolbox:
In order to run this example you will need to download the 20 Newsgroup dataset 20news-bydate.tar.gz from http://qwone.com/~jason/20Newsgroups/ and place the unpacked dir "20-news-bydate" in the "./input" dir.
The execution order is as follows:
In order to run this example you will need to download the 20 Newsgroup dataset 20news-18828.tar.gz from http://qwone.com/~jason/20Newsgroups/ and place the unpacked dir "20news-18828" in the "./input" dir.
The execution order is as follows:
In the "./output" dir is a compressed file "_20news-19997.zip". These are the output files after running the DBN (shape: 2000-500-250-125-10 real output units) on the 20news-19997.tar.gz from http://qwone.com/~jason/20Newsgroups/ for 50 epochs pretraining and finetuning. Unzip the compressed chunks by running the shell script "output/_unzip.sh".
The execution order is as follows:
The toolbox apply to all text datasets as long as the execution order is followed (cf. Examples 1 and 2):
Please note that many of the learning parameters are hardcoded into the pretraining and finetuning modules. The current setting has proven to work on various datasets.
During execution all data is saved to the harddrive which slows down the execution, but will eliminate any out-of-memory errors. Furthermore it gives the analyst the ability to resume the training at a random point in training even with different parameters.
(cf. the article or M.Sc. thesis mentioned in the beginning for proper citations to litterature used in order to realize this toolbox.)
Please do not hessitate to contact or contribute if any errors or ideas occur. Enjoy.
Best regards
Lars Maaloee, PHD student, Technical University of Denmark, LinkedIn