HLTA is a novel method for hierarchical topic detection. Specifically, it models document collections using a class of graphical models called hierarchical latent tree models (HLTMs). The variables at the bottom level of an HLTM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest latent level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. Unlike LDA-based topic models, HLTMs do not refer to a document generation process and use word variables instead of token variables. They use a tree structure to model the relationships between topics and words, which is conducive to the discovery of meaningful topics and topic hierarchies.
A basic version of HLTA is proposed here: Hierarchical Latent Tree Analysis for Topic Detection. Tengfei Liu, Nevin L. Zhang and Peixian Chen. ECML/PKDD 2014: 256-272
An accelarated version of HLTA is proposed by using Progressive EM: Progressive EM for Latent Tree Models and Hierarchical Topic Detection. Peixian Chen, Nevin L. Zhang, Leonard K. M. Poon and Zhourong Chen. AAAI 2016
A full version of HLTA with comprehensive discription as well as several extensions can be found at:
Latent Tree Models for Hierarchical Topic Detection.
Peixian Chen, Nevin L. Zhang et al.
An IJCAI tutorial and demonstration can be found at: Multidimensional Text Clustering for Hierarchical Topic Detection (IJCAI 2016 Tutorial) by Nevin L. Zhang and Leonard K.M. Poon
The original HLTA java call associated to the papers: Old HLTA Page
Download the HLTA.jar
and HLTA-deps.jar
from the Release page.
An all-in-one command for hierarchical topic detection. It brings you through data conversion, model building, topic extraction and topic assignment.
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HTD ./quickstart someName
If you are in windows, remember to use semicolon instead
java -cp HLTA.jar;HLTA-deps.jar tm.hlta.HTD ./quickstart someName
The output files include:
someName.sparse.txt
: the converted data, generated if data conversion is necessarysomeName.bif
: HLTA model filesomeName.html
: HTML visualization someName.nodes.js
: a topic treesomeName.topics.js
: a document catalog grouped by topicslib
: Javascript and CSS files required by the main HTML filefonts
: fonts used by some CSS filesYou can also do
java -cp HLTA.jar;HLTA-deps.jar tm.hlta.HTD documents.txt someName
Your documents.txt
:
One line is one single document. You can have many sentences as you want.
The quick brown fox jump over the lazy dog. But the lazy dog is too big to be jumped over!
Lorem ipsum dolor sit amet, consectetur adipiscing elit
Maecenas in ligula at odio convallis consectetur eu ut erat
Convert text files to bag-of-words representation with 1000 words and 1 concatenation:
java -cp HLTA.jar:HLTA-deps.jar tm.text.Convert myData ./source 1000 1
After conversion, you can find:
myData.sparse.txt
: data in tuple format, i.e. lines of (docId, word) pairmyData.dict.csv
: the vocabulary list ('.dict-0.csv' is the list w/o concatenation, '.dict-1.csv' is after 1 concatenation, etc.)You may put your files anywhere in ./source. It accepts txt and pdf.
./source/IAmPDF.pdf
./source/OneDocument.txt
./source/Folder1/Folder2/Folder3/HiddenSecret.txt
java -cp HLTA.jar:HLTA-deps.jar tm.text.Convert --testset-ratio 0.2 myData ./source 1000 1
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HLTA myData.sparse.txt 50 myModel
The output files include:
myModel.bif
: HLTA model fileExract topic from topic model
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.ExtractTopicTree myTopicTree myModel.bif myDataset.sparse.txt
The output files include:
myTopicTree.html
: a websitemyTopicTree.nodes.js
: a topic tree stored in javascriptmyTopicTree.nodes.json
: a topic tree stored as jsonlib
: Javascript and CSS files required by the main HTML filefonts
: fonts used by some CSS filesYou may use the "broadly defined topics" to speed up the process. Under this definition, more document will be categorized into a topic. (ref paper section 8.2.1)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.ExtractTopicTree --broad myTopicTree myModel.bif
Find out which documents belongs to that topic (i.e. inference)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.Doc2VecAssignment myModel.bif myData.sparse.txt myAssignment
The output files include:
myAssignment.topics.json
: a document catalog grouped by topicmyAssignment.topics.js
: a document catalog stored as javascript variablemyAssignment.arff
: doc2vec assignments in arff formatYou may use the "broadly defined topics" to speed up the process. Under this definition, more document will be categorized into a topic. (ref paper section 8.2.1)
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.Doc2VecAssignment --broad myModel.bif myData.sparse.txt topics
Evaluate by topic coherence
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.TopicCoherence myTopicTree.nodes.json myData.sparse.txt
Evaluate by topic compactness. (v2.3+)
java -Xmx4G -cp HLTA.jar:HLTA-deps.jar tm.hlta.TopicCompactness myTopicTree.nodes.json GoogleNews-vectors-negative300.bin
Download pre-trained word2vec model from https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing
Compute topic compactness in Python Install gensim (https://radimrehurek.com/gensim/) before using the python codes for computing compactness scores in AAAI17 paper (http://www.aaai.org/Conferences/AAAI/2017/PreliminaryPapers/12-Chen-Z-14201.pdf). One pre-trained Word2Vec model by Google is available at https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit?usp=sharing. The description of the model can be found at https://code.google.com/archive/p/word2vec/ under the section "Pre-trained word and phrase vectors".
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.PerDocumentLoglikelihood myModel.bif myData.test.sparse.txt
As introduced in Subroutine2 of Quick Example, we can train HLTA with default hyper-parameters by :
java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HLTA myData.sparse.txt 50 myModel
HLTA also supports to tune hyper-parameters by (v2.3+):
java -cp HLTA.jar:HLTA-deps.jar clustering.StepwiseEMHLTA $trainingdata $EmMaxSteps $EmNumRestarts $EM-threshold $UDtest-threshold $outputmodel $MaxIsland $MaxTop $GlobalsizeBatch $GlobalMaxEpochs $GlobalEMmaxsteps $IslandNotBridging $SampleSizeForstructureLearn $MaxCoreNumber $parallelIslandFindingLevel $CT-threshold
For example (v2.3+),
java -cp HLTA.jar:HLTA-deps.jar clustering.StepwiseEMHLTA myData.sparse.txt 50 3 0.01 3 myModel 15 30 500 10 100 1 10000 2 1
Notice that, to speed up the training:
If you need to modify source code and recompile HLTA, please follow next steps to build a sbt directory and compile HLTA. If not, please skip this session.
Run the following command to build the JAR files from source code:
sbt clean assembly assemblyPackageDependency && ./rename-deps.sh
The output are "HLTA.jar" and "HLTA-deps.jar", which are in "target/scala-2.12/" and are executable with the instruction in "Quick Example".