For project Sherlock, our team aims to use NLP tools to analyse large collections of documents. The original description of the team's goals are on Sherlock's repo.
The following sections describes the process for going from a bunch of plain text documents (emails in this case) to a nice visualization of the topics in these documents.
We are working in a mixture of Python, Scala, Spark, R, and other tools. Setup instructions for each of these tools is described here.
How do you get spark running locally? Is it necessary? Or is this optional (because we are using Spark on the cluster)
Forqlift is a tool for converting plain text files to sequence files. HDFS (and thus spark) does not work well with lots of small files, so sequence files are used instead.
To install forqlift, simply download the binaries and extract them. Add $FORQLIFT/bin
to your PATH
and you are ready to run forqlift
.
A good example data set is the Enron email archive. This data set can be downloaded from here.
The initial enron email data set can be found here. This compressed file contains plain text emails. Use forqlift
to create a sequence file:
forqlift fromarchive enron_mail_20150507.tgz --file enron_mail.seq --compress bzip2 --data-type text
Inputs:
Outputs:
Prepare e-mails as stored in sequence file for LDA classication with EmailParser.scala.
spark-submit --class EmailParser $myjar data/enron_mail.seq --metadata data/metadata.seq --dictionary data/dic.csv --corpus data/bow.csv
The prep-processing includes the exclusion of words that are too popular and the exclusion of words that are too rare, the criteria for this can be set with the optional arguments.
Inputs:
optional arguments for EmailParser, see also:
spark-submit --class EmailParser $myjar --help
Outputs:
This step could be run multiple times (for different number of topics).
See also the documentation on: https://github.com/nlesc-sherlock/spark-lda
spark-submit --class ScalaLDA $myjar --k 10 data/bow.csv data/lda.csv
Inputs:
Outputs:
For more information on the LDA optimization see here
Use LDA model to generate document topic matrix
spark-submit --class ApplyLDA $myjar data/lda.csv.model data/bow.csv data/document_topics.csv
Inputs:
Outputs:
This paper talks about the issues with topic model stability -- would be interesting to read and see what we can learn from them.