This is final project for a short 10 week course in Natural Language Processing (NLP). In this project, two classifiers are ensembled with a recurrent neural network. Specifically, a long short-term memory (LSTM) generative learning agent was deployed, with two classifiers assisting as rules-based components.
The first classifier attempts to predict whether a given sentence is a question. The second predicts what StackOverflow channel best represents a given sentence. Therefore, the ensembled application executes the LSTM agent when a given sentence is predicted as a question. If the agent is incapable of returning an adequate response, then the second classifier proposes a stackoverflow channel.
Since this project is a proof of concept, the necessary build has been automated using a Vagrantfile
. This local development requires both vagrant + virtualbox to be installed. However, a supplied docker-compose.yml
could simulate a big data scenario, where ingested data is distributed across multiple mongodb nodes. For production systems, kubernetes would likely replace the docker-compose
variant. Additionally, supplied utility scripts can be used to install and configure cuda and gpu-based tensorflow.
Lastly, a supplied config-TEMPLATE.py
, will need to be copied in the same directory as config.py
. Though values can be adjusted, the mongos_endpoint
will need to match the mongodb endpoint. Specifically, if the local vagrant instance was deployed, then the following configurations would be appropriate:
# general
mongos_endpoint = ['localhost:27017']
database = 'reddit'
collection = 'qa_reddit'
data_directory = 'data'
Three different datasets are used to generate respective models
The run.py
is an entrypoint script supporting the following features:
--insert
: inserts relative Reddit/data/
data into the specified mongodb endpoint--train
: trains an LSTM recurrent neural network for the "inserted" mongodb data--local
: implement the local trained LSTM model--drop
: drops all documents in the default mongodb collection used during --insert
--generic
: implements a pretrained LSTM model based on 1M comment-reply pairsThe objective of this project was to perform classification analysis:
StackOverflow_Classification.ipynb
StackOverflow_SKLearn.ipynb
QuestionAnswerCMU_Classification.ipynb
Additionally, two write-ups are provided, discussing both the analysis, and ensembled application of both classifiers with the trained LSTM neural network: