"MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text."
The below output used 15 as number of topics, and prints the top 5 keywords for each topic. The corpus was a collection of 450 Civil War obituaries, and 50 running race reports (two very different categories of data.)
#################OUTPUT#############
List of Topics
Obituaries Russell County Death Date
Alderson man Fields county Captain
years Mr Rev home church
Image Lebanon VA News 11
Jones grandchildren Lewis great Browning
mile race run Dwight Race
Jackson Duff Bundy Steele left
Kiser Ball Hurt Hendricks Norton
Honaker Fogleman ago Love Buckles
death good God friends friend
Litton Vicars Vermillion Gap Va
Bausell Bays Hill Camp Webb
train Porter head man struck
race miles time finish running
Obituaries Russell County Obituary Soldiers
##########################
Topics 5 and 13 are clearly the 50 running documents, and the other topics neatly highlight various aspects of the Civil War obituaries. We can massage this data by replacing the topic numbers with user-chosen categories:
List of Topics
Obituaries - Obituaries Russell County Death Date
Names, locations, and ranks - Alderson man Fields county Captain
Religious - years Mr Rev home church
Newspaper images - Image Lebanon VA News 11
Family - Jones grandchildren Lewis great Browning
Running and people - mile race run Dwight Race
Family names - Jackson Duff Bundy Steele left
Family names - Kiser Ball Hurt Hendricks Norton
Names and locations - Honaker Fogleman ago Love Buckles
Religious - death good God friends friend
Names and Locations - Litton Vicars Vermillion Gap Va
Names and Locations - Bausell Bays Hill Camp Webb
Death and names - train Porter head man struck
Running - race miles time finish running
Obituaries - Obituaries Russell County Obituary Soldiers
From this we can further group the data into topics, and simplify:
Mallet: http://mallet.cs.umass.edu/index.php
"MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text."
see also: http://programminghistorian.org/lessons/topic-modeling-and-mallet
use standalone Java version at: https://github.com/senderle/topic-modeling-tool
The below output used 15 as number of topics, and prints the top 5 keywords for each topic. The corpus was a collection of 450 Civil War obituaries, and 50 running race reports (two very different categories of data.)
#################OUTPUT############# List of Topics
List of Topics
Obituaries - Obituaries Russell County Death Date Names, locations, and ranks - Alderson man Fields county Captain Religious - years Mr Rev home church Newspaper images - Image Lebanon VA News 11 Family - Jones grandchildren Lewis great Browning Running and people - mile race run Dwight Race Family names - Jackson Duff Bundy Steele left Family names - Kiser Ball Hurt Hendricks Norton Names and locations - Honaker Fogleman ago Love Buckles Religious - death good God friends friend Names and Locations - Litton Vicars Vermillion Gap Va Names and Locations - Bausell Bays Hill Camp Webb Death and names - train Porter head man struck Running - race miles time finish running Obituaries - Obituaries Russell County Obituary Soldiers
From this we can further group the data into topics, and simplify:
Category 1 - Obituaries, Names, locations, ranks, religious, newspaper images, family, family names, death.
Category 2 - Running and people.
Thus we have gone from close to 500 documents, to two categories in a few minutes.