General syntax updates for changes from Python2 to Python3
Code comments have been significantly updated
cPickle --> pickle
Dependencies:
scipy==0.12.0 --> scipy==0.18.1
scikit-learn==0.15.1 --> scikit-learn==0.18.1
numpy==1.9.0 --> numpy==1.12.0
nltk==3.0.0 --> nltk==3.2.1
Support Data:
There was a since nonstandard unicode character (the accented i in naive) in the list of negative words. I replaced this character with the correct escape sequence. To properly parse the escape sequence, codecs.open was used instead of the standard open for this file.
Additions:
Classification Model:
The use updated pickle and scikit-learn libraries caused quite a hassle:
Classification models created using scikit-learn prior to version 0.18.1, did not embed the scikit-learn version into model. Since 0.18.1 does embed its version number, it gave a number of warnings about loading an outdated model.
I tried countless methods, but Python3's pickle simply cannot derserialize an object that was serialized using Python2's cPickle.
In response to these conjoined errors, I decided to retrain the classification model using the original dataset. Since this required generating dependency parses for each request in the dataset, I chose to dump the annotated dataset, with dependency parses, as a json dictionary in the format expected by /scripts/train_model.py. See the README for information about how to download these datasets (they are too large for version control).
The retrained model is included in this pull-request. Since the json formatted training datasets I have provided are so large, retraining the model will require a significant amount of memory.
Original Datasets:
I have included the original datasets in the corpora/ directory.
Helper Functions:
I have created an additional file, corpora/download.py, which automates downloading and extracting the json formatted training data.
Changes:
Python:
cPickle
-->pickle
Dependencies:
scipy==0.12.0
-->scipy==0.18.1
scikit-learn==0.15.1
-->scikit-learn==0.18.1
numpy==1.9.0
-->numpy==1.12.0
nltk==3.0.0
-->nltk==3.2.1
Support Data:
codecs.open
was used instead of the standardopen
for this file.Additions:
Classification Model:
pickle
andscikit-learn
libraries caused quite a hassle:scikit-learn
prior to version0.18.1
, did not embed thescikit-learn
version into model. Since0.18.1
does embed its version number, it gave a number of warnings about loading an outdated model.pickle
simply cannot derserialize an object that was serialized using Python2'scPickle
./scripts/train_model.py
. See theREADME
for information about how to download these datasets (they are too large for version control).Original Datasets:
corpora/
directory.Helper Functions:
corpora/download.py
, which automates downloading and extracting the json formatted training data.