Analyzing Corpora

For project Sherlock, our team aims to use NLP tools to analyse large collections of documents. The original description of the team's goals are on Sherlock's repo.

The following sections describes the process for going from a bunch of plain text documents (emails in this case) to a nice visualization of the topics in these documents.

Tools

We are working in a mixture of Python, Scala, Spark, R, and other tools. Setup instructions for each of these tools is described here.

Spark setup

How do you get spark running locally? Is it necessary? Or is this optional (because we are using Spark on the cluster)

Forqlift

Forqlift is a tool for converting plain text files to sequence files. HDFS (and thus spark) does not work well with lots of small files, so sequence files are used instead.

To install forqlift, simply download the binaries and extract them. Add $FORQLIFT/bin to your PATH and you are ready to run forqlift.

Dataset

A good example data set is the Enron email archive. This data set can be downloaded from here.

Step 1 - The original data

The initial enron email data set can be found here. This compressed file contains plain text emails. Use forqlift to create a sequence file:

forqlift fromarchive enron_mail_20150507.tgz --file enron_mail.seq --compress bzip2 --data-type text

Inputs:

.tgz file

Outputs:

.seq file with inside all e-mails

Step 2 - Preprocessing

Prepare e-mails as stored in sequence file for LDA classication with EmailParser.scala.

spark-submit --class EmailParser $myjar data/enron_mail.seq --metadata data/metadata.seq --dictionary data/dic.csv --corpus data/bow.csv

The prep-processing includes the exclusion of words that are too popular and the exclusion of words that are too rare, the criteria for this can be set with the optional arguments.

Inputs:

enron_mail.seq sequence file with all the e-mails
specify output-files
optional arguments for EmailParser, see also:

spark-submit --class EmailParser $myjar --help

Outputs:

Dictionary (.csv):
- dictionary linking wordid (integer) and word (character)
Bags of words (.csv):
- word count (integer) per document: documentid x wordid
metadata (.seq):
- id: e-mail unique identifier
- path to e-mail file
- user (sender): character
- from: emailaddress
- to: list of emailaddress(es)
- cc: list of emailaddress(es)
- bcc: list of emailaddress(es)
- sent data-type/rec. data-type
- MIMMSGID *
- subject: subject of the e-mail, one character string

Step 3 - Train LDA

This step could be run multiple times (for different number of topics).

See also the documentation on: https://github.com/nlesc-sherlock/spark-lda

  spark-submit --class ScalaLDA $myjar --k 10 data/bow.csv data/lda.csv

Inputs:

Bags of words generated by Step 2
k number of desired topics

Outputs:

Word by topic matrix (.csv)
- weights (floating point between 0 and 1) for wordid (integer) x topicid (integer)

For more information on the LDA optimization see here

Step 4 - Apply LDA

Use LDA model to generate document topic matrix

spark-submit --class ApplyLDA $myjar data/lda.csv.model data/bow.csv data/document_topics.csv

Inputs:

Word by topic matrix from step 3

Outputs:

Document by topic matrix / the LDA classification (.csv)
- weights (floating point between 0 and 1) for documentid (integer) x topicid (integer)

nlesc-sherlock / analyzing-corpora

readme

Analyzing Corpora

Tools

Spark setup

Forqlift

Dataset

Step 1 - The original data

Step 2 - Preprocessing

Step 3 - Train LDA

Step 4 - Apply LDA

Step 5 - Visualization

Step 5.a - Run clustering / visualization (IPython notebook)

Step 5.b - Run R-shiny visualization

Further reading