shalomeir / tctm

Text Classification using Topic Model
Other
4 stars 3 forks source link

Text Classification using Topic Model (TCTM)

Text Clasification using Topic Model (TCTM) is a Java-based package for text classification tool using topic model. This application respond to Windows only until now.

TCTM is made based on MALLET toolkit. This use a SVM classifier from SVM multiclass provided by Thorsten Joachims.

This application's input is a text documents set divided by training/test, and directory should be exist per class. First output is a topic model. Second one is bag of words(BOW) features. Third is joint with topic and BOW. Lastly, we provide a result by accuracy.

Sample input directory scheme : (see the data\fourNewsGroups\ )

Quick Start Guide

Once you have downloaded and installed TCTM, the easiest way to get started is follow below step.

  1. Check directory architecture that we want to classify. -input \data\sampleforTutorial\train* (each directory should be a class(=label) name) \data\sampleforTutorial\test*

  2. Topic Modeling Make a model from a whole text corpus. edu.kaist.irlab.topics.tui.Text2VariedTopicModels --input data/sampleforTutorial/total/* --output-dir data/sampleforTutorial/topicmodel

  3. Feature Set Generation Make a varied feature set from train and test data and topic models. edu.kaist.irlab.topics.tui.Text2VariedSvmLightFeatures --input-train-dir data/sampleforTutorial/train/ --input-test-dir data/sampleforTutorial/test/ --input-topic-dir data/sampleforTutorial/topicmodel/VTopicModel_Wi100_Di200 --output-dir data/sampleforTutorial/FeatureSet_Wi100_Di200

  4. SVM Classification per feature set Excute SVM Multiclass Classification edu.kaist.irlab.classify.tui.ExecuteSvmMulticlass --input-dir data/sampleforTutorial/FeatureSet_Wi100_Di200/*

See more : https://docs.google.com/presentation/d/1emBVYeGFYF4F2Nbp9quiSrksXE-M-iNL8_-bCoiQzgo/edit#slide=id.p