maxpr / PCML_Proj2

0 stars 0 forks source link

Machine Learning - Project 2 tweets prediction

Source files

Baseline

The baseline method is not used in our code, but can be seen in the paper, and in the file called text_classifier.py

Input

-The paths to the files : pos_train and neg_train,test_data. -Uses the vocab.pkl and embeddings from the co-occurence matrix python file (not describe here as not used and given with the project)

Output

Further implementation

Input

-The paths to the files : pos_train and neg_train, test_data. -Additional features in the file, is forced by hand ( we fill them one by one), but having this number help construct the vectors and matrices -Uses either weakLearner for prediction or embedding generated by word2vec librairies.

Output

Model Selection

When calling the run.py, you can choose between the boosting method or the word2vec by giving the following argument as parameters "boosting"/"word2vec"

Model

For the word2vec`you can choose all the data you want it to send, it'll construct a model with word appearing more than 2 time , size of features of 200 and windows of 8 If you want to change these parameters, just modify the lines 70-71-72-73. A k-fold error will be printed, the model run for a reasonable time for full set (maximum 2hours) For Boosting, weak learner are used to

Example


python run.py boosting (run boostin)
python run.py word2vec (run word2vec)
Each with given set as argument in the file (can be changed easily)

## How to generate final predictions
Run the script ```run.py```, which will output in file ```data/predictions.csv``` the final prediction
The best prediction is obtained by doing ./run.py word2vec