kmpoon / hlta

Provides functions for hierarchical latent tree analysis on text data for hierarchical topic detection
GNU General Public License v3.0
81 stars 23 forks source link

Hierarchical Latent Tree Analysis (HLTA)

HLTA is a novel method for hierarchical topic detection. Specifically, it models document collections using a class of graphical models called hierarchical latent tree models (HLTMs). The variables at the bottom level of an HLTM are observed binary variables that represent the presence/absence of words in a document. The variables at other levels are binary latent variables, with those at the lowest latent level representing word co-occurrence patterns and those at higher levels representing co-occurrence of patterns at the level below. Each latent variable gives a soft partition of the documents, and document clusters in the partitions are interpreted as topics. Unlike LDA-based topic models, HLTMs do not refer to a document generation process and use word variables instead of token variables. They use a tree structure to model the relationships between topics and words, which is conducive to the discovery of meaningful topics and topic hierarchies.

A basic version of HLTA is proposed here: Hierarchical Latent Tree Analysis for Topic Detection. Tengfei Liu, Nevin L. Zhang and Peixian Chen. ECML/PKDD 2014: 256-272

An accelarated version of HLTA is proposed by using Progressive EM: Progressive EM for Latent Tree Models and Hierarchical Topic Detection. Peixian Chen, Nevin L. Zhang, Leonard K. M. Poon and Zhourong Chen. AAAI 2016

A full version of HLTA with comprehensive discription as well as several extensions can be found at: Latent Tree Models for Hierarchical Topic Detection.
Peixian Chen, Nevin L. Zhang et al.

An IJCAI tutorial and demonstration can be found at: Multidimensional Text Clustering for Hierarchical Topic Detection (IJCAI 2016 Tutorial) by Nevin L. Zhang and Leonard K.M. Poon

The original HLTA java call associated to the papers: Old HLTA Page

Quick Example

Subroutine 1: Convert Text Files to Data

Subroutine 2: Model Building

Subroutine 3: Extract Topic Trees

Subroutine 4: Doc2Vec Assignment

Options

As introduced in Subroutine2 of Quick Example, we can train HLTA with default hyper-parameters by :

   java -cp HLTA.jar:HLTA-deps.jar tm.hlta.HLTA myData.sparse.txt 50 myModel

HLTA also supports to tune hyper-parameters by (v2.3+):

   java -cp HLTA.jar:HLTA-deps.jar clustering.StepwiseEMHLTA $trainingdata $EmMaxSteps $EmNumRestarts $EM-threshold $UDtest-threshold $outputmodel $MaxIsland $MaxTop $GlobalsizeBatch $GlobalMaxEpochs $GlobalEMmaxsteps $IslandNotBridging $SampleSizeForstructureLearn $MaxCoreNumber $parallelIslandFindingLevel $CT-threshold

For example (v2.3+),

   java -cp HLTA.jar:HLTA-deps.jar clustering.StepwiseEMHLTA myData.sparse.txt 50 3 0.01 3 myModel 15 30 500 10 100 1 10000 2 1

Notice that, to speed up the training:

  1. $trainingdata: the file name of training data
  2. $EmMaxSteps: max steps in EM (default: 50)
  3. $EmNumRestarts: numner of restarters in EM (default: 3)
  4. $EM-threshold: threshold to control the stop of EM (default: 0.01)
  5. $UDtest-threshold: threshold to control whether the islands can pass UDtest (default: 3)
  6. $outputmodel: name of output model
  7. $MaxIsland: The maximum number of variables in one island (default: 15)
  8. $MaxTop: max variable numbers for top level (default: 30)
  9. $GlobalsizeBatch: batch size in global stepwise EM for parameter learning (default: 500)
  10. $GlobalMaxEpochs: max epoch number in global stepwise EM for parameter learning (default: 10)
  11. $GlobalEMmaxsteps: step numbers in global stepwise EM for parameter learning (default: 100)
  12. $IslandNotBridging: remove island bridging or not, the default value is 1 meaning to remove island bridging. (default: 1)
  13. $SampleSizeForstructureLearn: how many samples are used in structure leanring. (default: 10000)
  14. $MaxCoreNumber: means the number of parallel CPU process. (default: 2) Users can choose a suitable core number considering the scale of their dataset. The further analysis on the balance of speed and performance can be found paper. Notice that, this number should not exceed the CPU core number of your machine, otherwise, it will slow HLTA.
  15. $parallelIslandFindingLevel: when $MaxCoreNumber > 1, $parallelIslandFindingLevel means the max level that use parallel island finding. For example, $parallelIslandFindingLevel == 3 means level1, level2 and level3 use parallel island finding; while other levels use serial island finding.
  16. $CT-threshold: threshold to control whether the island can pass correlation test, leave this empty to skip correlation test (default: empty, that is no correlation test)

Assemble

If you need to modify source code and recompile HLTA, please follow next steps to build a sbt directory and compile HLTA. If not, please skip this session.

  1. Have Java 8 and sbt installed.
  2. Git clone this repository
  3. Change directory to the project directory. (e.g. user/git/hlta)
  4. Run the following command to build the JAR files from source code:

    sbt clean assembly assemblyPackageDependency && ./rename-deps.sh

    The output are "HLTA.jar" and "HLTA-deps.jar", which are in "target/scala-2.12/" and are executable with the instruction in "Quick Example".

    Enquiry

Contributors