nlesc-sherlock / analyzing-corpora

Using NLP to analyse large collections of documents.
Apache License 2.0
0 stars 0 forks source link

Analyzing Corpora

For project Sherlock, our team aims to use NLP tools to analyse large collections of documents. The original description of the team's goals are on Sherlock's repo.

The following sections describes the process for going from a bunch of plain text documents (emails in this case) to a nice visualization of the topics in these documents.

Tools

We are working in a mixture of Python, Scala, Spark, R, and other tools. Setup instructions for each of these tools is described here.

Spark setup

How do you get spark running locally? Is it necessary? Or is this optional (because we are using Spark on the cluster)

Forqlift

Forqlift is a tool for converting plain text files to sequence files. HDFS (and thus spark) does not work well with lots of small files, so sequence files are used instead.

To install forqlift, simply download the binaries and extract them. Add $FORQLIFT/bin to your PATH and you are ready to run forqlift.

Dataset

A good example data set is the Enron email archive. This data set can be downloaded from here.

Step 1 - The original data

The initial enron email data set can be found here. This compressed file contains plain text emails. Use forqlift to create a sequence file:

forqlift fromarchive enron_mail_20150507.tgz --file enron_mail.seq --compress bzip2 --data-type text

Inputs:

Outputs:

Step 2 - Preprocessing

Prepare e-mails as stored in sequence file for LDA classication with EmailParser.scala.

spark-submit --class EmailParser $myjar data/enron_mail.seq --metadata data/metadata.seq --dictionary data/dic.csv --corpus data/bow.csv

The prep-processing includes the exclusion of words that are too popular and the exclusion of words that are too rare, the criteria for this can be set with the optional arguments.

Inputs:

Outputs:

Step 3 - Train LDA

This step could be run multiple times (for different number of topics).

See also the documentation on: https://github.com/nlesc-sherlock/spark-lda

  spark-submit --class ScalaLDA $myjar --k 10 data/bow.csv data/lda.csv

Inputs:

Outputs:

For more information on the LDA optimization see here

Step 4 - Apply LDA

Use LDA model to generate document topic matrix

spark-submit --class ApplyLDA $myjar data/lda.csv.model data/bow.csv data/document_topics.csv

Inputs:

Outputs:

Step 5 - Visualization

Step 5.a - Run clustering / visualization (IPython notebook)

Step 5.b - Run R-shiny visualization

Further reading

This paper talks about the issues with topic model stability -- would be interesting to read and see what we can learn from them.