dataCleaning.py
: all the method to clean up the tweets set
glove_routines.py
: The SGD to factorize the co-occurence matrix
boosting/adaboost.py
: The main function using boosting to train and predict
boosting/trainset
: contains helper for adaboost to manage the weights example in order to easily compute the error rate of any of the remaining
weak learner.
boosting/vocabulary.py
: Contains the method that extract the vocabulary of the given file path as input
boosting/weakLearner.py
:contains the definition and method of weak learners
text_classifier.py
: helper functions to read data and submit predictions
word2vec_routines.py
: helper functions to read data and submit predictions
run.py
: functions to generate a prediction with the final features and script to generate the best submission (with boosting or word2vec)
data/
: folder containing the source data (training and test sets) and where are stored the predictions uploaded on Kaggle
sandbox_2/max.ipnyb
: notebook for tests
The baseline method is not used in our code, but can be seen in the paper, and in the file called text_classifier.py
-The paths to the files : pos_train and neg_train,test_data. -Uses the vocab.pkl and embeddings from the co-occurence matrix python file (not describe here as not used and given with the project)
-The paths to the files : pos_train and neg_train, test_data. -Additional features in the file, is forced by hand ( we fill them one by one), but having this number help construct the vectors and matrices -Uses either weakLearner for prediction or embedding generated by word2vec librairies.
When calling the run.py, you can choose between the boosting method or the word2vec by giving the following argument as parameters "boosting"/"word2vec"
For the word2vec`you can choose all the data you want it to send, it'll construct a model with word appearing more than 2 time , size of features of 200 and windows of 8 If you want to change these parameters, just modify the lines 70-71-72-73. A k-fold error will be printed, the model run for a reasonable time for full set (maximum 2hours) For Boosting, weak learner are used to
python run.py boosting (run boostin)
python run.py word2vec (run word2vec)
Each with given set as argument in the file (can be changed easily)
## How to generate final predictions
Run the script ```run.py```, which will output in file ```data/predictions.csv``` the final prediction
The best prediction is obtained by doing ./run.py word2vec