Machine Learning - Project 2 tweets prediction

Source files

dataCleaning.py: all the method to clean up the tweets set
glove_routines.py: The SGD to factorize the co-occurence matrix
boosting/adaboost.py: The main function using boosting to train and predict
boosting/trainset: contains helper for adaboost to manage the weights example in order to easily compute the error rate of any of the remaining weak learner.
boosting/vocabulary.py: Contains the method that extract the vocabulary of the given file path as input
boosting/weakLearner.py:contains the definition and method of weak learners
text_classifier.py: helper functions to read data and submit predictions
word2vec_routines.py: helper functions to read data and submit predictions
run.py: functions to generate a prediction with the final features and script to generate the best submission (with boosting or word2vec)
data/: folder containing the source data (training and test sets) and where are stored the predictions uploaded on Kaggle
sandbox_2/max.ipnyb: notebook for tests

Baseline

The baseline method is not used in our code, but can be seen in the paper, and in the file called text_classifier.py

Input

-The paths to the files : pos_train and neg_train,test_data. -Uses the vocab.pkl and embeddings from the co-occurence matrix python file (not describe here as not used and given with the project)

Output

The csv file for prediction

Further implementation

Input

-The paths to the files : pos_train and neg_train, test_data. -Additional features in the file, is forced by hand ( we fill them one by one), but having this number help construct the vectors and matrices -Uses either weakLearner for prediction or embedding generated by word2vec librairies.

Output

The csv file for prediction

Model Selection

When calling the run.py, you can choose between the boosting method or the word2vec by giving the following argument as parameters "boosting"/"word2vec"

Model

For the word2vec`you can choose all the data you want it to send, it'll construct a model with word appearing more than 2 time , size of features of 200 and windows of 8 If you want to change these parameters, just modify the lines 70-71-72-73. A k-fold error will be printed, the model run for a reasonable time for full set (maximum 2hours) For Boosting, weak learner are used to

Example


python run.py boosting (run boostin)
python run.py word2vec (run word2vec)
Each with given set as argument in the file (can be changed easily)

## How to generate final predictions
Run the script ```run.py```, which will output in file ```data/predictions.csv``` the final prediction
The best prediction is obtained by doing ./run.py word2vec

maxpr / PCML_Proj2

readme

Machine Learning - Project 2 tweets prediction

Source files

Baseline

Input

Output

Further implementation

Input

Output

Model Selection

Model

Example