tapilab / protest

analyze brazilian protests on Twitter
0 stars 0 forks source link

Build a baseline classifier #2

Closed aronwc closed 9 years ago

aronwc commented 9 years ago

Use logistic regression to compute cross-validation accuracy on the binary classification task of "positive" versus "negative" or "other".

Consider various pre-processing strategies (tokenization, tf-idf, binary, n-grams).

aronwc commented 9 years ago

baseline accuracy =~ .68

aronwc commented 9 years ago
aronwc commented 9 years ago
ElaineResende commented 9 years ago

Customized Tokenizer done. Code available in the file: https://github.com/tapilab/ecig-classify/blob/master/FittingModel_ownTokenizer.ipynb

ElaineResende commented 9 years ago

Correct confusion matrix: Where 1=positive, 0=others, -1=negative confusion matrix

aronwc commented 9 years ago

Also consider different regularization for LogisticRegression: penalty='l2' or 'l1'

Also consider using both l2 and l1, using SGD

For both, you'll need to search over the regularization strength parameter (C or alpha)

C = .1, 1, 5, 10, …, 100,

aronwc commented 9 years ago

Try merging class 0 and -1 and report precision/recall/f1 of class 1

ElaineResende commented 9 years ago

I have merged the class 0 and -1 and the precision of class 1 = 0.9185..., recall = 0.6168... and F1 = 0.7380...

What I understand is: 1) the percentage of relevant tweets that are retrieved is good (91%) 2) the percentage of retrieved tweets overall relevant is not good, it should be higher 3) I am not sure about F1 measure, what is a good value for it?

Class 0 has better values for all measures. I am going to read the tweets again and try to see what I labelled "wrongly".

capture

ElaineResende commented 9 years ago

Logistic Regression parameters: penalty = ['l1', 'l2'] and

Using the last feature vector with 81% accuracy and just changing the parameter penalty, we have more accuracy using 'l2' (which is default) image

C = [0.01, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 5, 10, 30, 50, 70, 100] Changing just C parameter, we have best value: C=5

image

Using C=5 and penalty='l2' the accuracy is 0.8125 (increased from 0.002)

PS: When I did gridSearch I got back best parameters as C=1 and penalty='l2' (0.8125), what is strange because when I did cross validation with those parameters I got 0.8105.

Number of TP for positive class improved from previous model. image

Now we have image

Precision: [ 0.92935178, 0.92344498] Recall: [ 0.96374622, 0.85650888] Fscore: [ 0.94623656, 0.88871834]

ElaineResende commented 9 years ago

I have used SGD classifier. Using this classifier with default parameters the accuracy is 0.7855,

I did grid search to check best parameters: penalty': ('none', 'l2', 'l1', 'elasticnet'), alpha': (.0001, .0005,.0006,.0007,.0008,.0009,.001, 5, 10, 100),

As result we have: Best parameters = alpha: 0.0005 and penalty: 'l2'

Fitting the model with these parameters we have accuracy = 0.8130

PS: Same result using 'elasticnet' as penalty in fitting time.

With SGD classifier, the number of TP for class 0 increases, but for class 1 image

Precision: [ 0.88919477, 0.94149909] Recall: [ 0.97583082, 0.76183432] Fscore: [ 0.93050054, 0.84219133]

aronwc commented 9 years ago

LogisticRegression, l2, C=5: Pr/Re/F1= 0.92344498 / 0.85650888 / 0.88871834

Let's use this configuration to classify the remaining tweets.

ElaineResende commented 9 years ago

I have made another analysis in what I have done and the new best value for C is 2.6. image

The accuracy is 0.818