Improve classifier - Githubissues

aronwc commented 8 years ago

Try various settings for classifier: C, ngrams (1,1), (1,2); min_df, max_df
Try various settings for negative_window
Look at highest/lowest coefficients to determine if they make sense, and to brainstorm other features we might use
Add features from neighbors (i.e. % of neighbors that have used hashtag)

aronwc commented 8 years ago

We definitely need to reduce vocabulary size using min_df (currently 1,509,078 terms)

aronwc commented 8 years ago

Depending on how this goes, we may also consider a slightly different prediction problem. E.g., a regression where the input is a tweet and the output is how many days prior to first protest tweet. While this is insanely hard to predict, it may capture the idea that the intensity of certain terms should increase over time.

ElaineResende commented 8 years ago

Even though the window=50 has the biggest testing accuracy, the best model in my view, is the one with a window=10 because the positive coefficients are the most correlated to the protests, for example:

panelaço - comes from "pot" and its meaning is related to people hitting pots to make noise after - - Dilma's speech on TV (https://www.youtube.com/watch?t=10&v=uCu62giHZSA)
HASHTAG_vaiadilma - means people booing dilma after the same speech.
HASHTAG_lulanacadeia - means put Lula (ex-president same party as Dilma) in jail
panelinhas - referring to panelaço

negative_window	ngram	min_df	max_df	train acc	test acc	n users
10	(1,2)	2	1.0	.591	.507	285
20	(1,2)	4	1.0	.611	.520	274
30	(1,2)	4	1.0	.617	.513	264
40	(1,1)	1	0.1	.837	.537	259
50	(2,2)	2	0.1	.751	.559	255

aronwc commented 8 years ago

I played around with a slightly different classification setup. (See Classification-Optimize-Window. )

I introduced two parameters:

window_sz: how many tweets should be used to construct each instance.
gap_sz: how many tweets should there be between the final tweet of the negative instance and the first tweet of the positive instance.

For example, if the first protest tweet occurs at tweet number 100, window_sz=5, and gap_sz=10, then we'll get:

positive instance: tweets[95:100]
negative instance: tweets[80:85]

The main difference is that previously the positive and negative instances contained many of the same tweets, whereas here they have no tweets in common.

I then iterated over many, many combinations of tokenizations, window_sz, and gap_sz. The results are here. The top 20 results are:

0.595926          collapse_mentions=True  collapse_digits=True  binary=False  ngram_range=(1, 2)  min_df=2  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.593526          collapse_mentions=True  collapse_digits=True  binary=True  ngram_range=(1, 1)  min_df=4  use_idf=False  norm=None  window_sz=20  gap_sz=100
0.590641          collapse_mentions=True  collapse_digits=False  binary=False  ngram_range=(1, 2)  min_df=2  use_idf=True  norm=None  window_sz=20  gap_sz=100
0.590471          collapse_mentions=True  collapse_digits=True  binary=True  ngram_range=(1, 1)  min_df=2  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.590404          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 2)  min_df=4  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.588552          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 1)  min_df=2  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.588519          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 1)  min_df=4  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.588462          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 1)  min_df=2  use_idf=False  norm=None  window_sz=20  gap_sz=100
0.588397          collapse_mentions=True  collapse_digits=True  binary=False  ngram_range=(1, 2)  min_df=2  use_idf=True  norm=l2  window_sz=20  gap_sz=100
0.588095          collapse_mentions=True  collapse_digits=True  binary=True  ngram_range=(1, 1)  min_df=4  use_idf=True  norm=None  window_sz=5  gap_sz=100
0.588095          collapse_mentions=True  collapse_digits=False  binary=False  ngram_range=(1, 1)  min_df=4  use_idf=False  norm=None  window_sz=5  gap_sz=100
0.586835          collapse_mentions=True  collapse_digits=True  binary=True  ngram_range=(1, 2)  min_df=4  use_idf=True  norm=l2  window_sz=1  gap_sz=20
0.586768          collapse_mentions=True  collapse_digits=True  binary=True  ngram_range=(1, 2)  min_df=2  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.586667          collapse_mentions=True  collapse_digits=True  binary=False  ngram_range=(1, 1)  min_df=4  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.585123          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 2)  min_df=4  use_idf=False  norm=None  window_sz=5  gap_sz=20
0.584983          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 2)  min_df=2  use_idf=False  norm=None  window_sz=1  gap_sz=20
0.584949          collapse_mentions=True  collapse_digits=True  binary=False  ngram_range=(1, 1)  min_df=2  use_idf=False  norm=l2  window_sz=1  gap_sz=20
0.584038          collapse_mentions=True  collapse_digits=False  binary=False  ngram_range=(1, 1)  min_df=4  use_idf=True  norm=None  window_sz=50  gap_sz=50
0.583462          collapse_mentions=True  collapse_digits=False  binary=True  ngram_range=(1, 1)  min_df=4  use_idf=True  norm=l2  window_sz=20  gap_sz=100
0.583397          collapse_mentions=True  collapse_digits=True  binary=True  ngram_range=(1, 1)  min_df=2  use_idf=False  norm=None  window_sz=20  gap_sz=100

I recommend we stick with the third place one:

0.590641          collapse_mentions=True  collapse_digits=False  binary=False  ngram_range=(1, 2)  min_df=2  use_idf=True  norm=None  window_sz=20  gap_sz=100

This has high accuracy, uses a largish window size (2), and also includes bigrams.

Here are the top terms. You'll have to tell me if they make any sense ;)

X.shape=(398, 12198)
HASHTAG_sabadodetremuranosdv
HASHTAG_globonews
ele
bem
volta
ontem
seu
THIS_IS_A_URL rt
cruzeiro
falando
photo
sim
dilma
que eu
joga
dia do
esse
chegando
eh
p
tudo
calma
o o
amiga
campos
parabéns
ahahahahaha
MENTION ahahahahaha
presidente
you
amor
alguns
perfil
MENTION te
MENTION HASHTAG_globonews
casa
cabeça
tem
pastor
HASHTAG_verdadessecretas

If I use that setting and vary the gap size, I get the following graph:

While there is some variance, accuracy generally increases with gap size, as expected.

Based on this, I recommend the following:

modifying Classification-testing to use the tokenization setting above.
modifying Classification-testing to consider sliding windows of size 20 when computing classification scores (rather then using all tweets up to time T).

ElaineResende commented 8 years ago

So, for the testing set is going to be all tweets a user have posted in jumping windows of size 20, it does not go just up to time T correct?

aronwc commented 8 years ago

Up to the first protest tweet.

On Fri, Oct 16, 2015 at 8:55 PM, ElaineResende notifications@github.com wrote:

So, for the testing set is going to be all tweets a user have posted in jumping windows of size 20, it does not go just up to time T correct?

— Reply to this email directly or view it on GitHub https://github.com/tapilab/protest/issues/17#issuecomment-148876386.

ElaineResende commented 8 years ago

I have changed Classification-testing nb and created a list of stopwords for Portuguese language and added it to default English list of stopwords. After this, we have a better list of coefficients. Such as:

HASHTAG_sabadodetremuranosdv
THIS_IS_A_URL rt
HASHTAG_sextadosdvcomsrtwitteiro
perfil
###### dilma
photo
volta
ontem
###### pt ---> dilma's party
sábado
chegando
cruzeiro
globo
brasil
sim
THIS_IS_A_URL acabou
THIS_IS_A_URL
falando
15
p
###### HASHTAG_debatenaglobo ---> debate on globo channel
tipo
###### corrupção ---> corruption
amiga
passa
bem
mudar
joga
preguiça
gol
parabéns
vamos
sábado 15
###### fora ---> out
###### presidente ---> president
HASHTAG_brasil
campos
gosta
HASHTAG_verdadessecretas
t
gols
photo THIS_IS_A_URL
status
metade
fred
###### deputado ---> deputy
vocês
casa
amor
###### panela ---> pot
HASHTAG_luciananaglobo
HASHTAG_mudabrasil
MENTION te
cortina
posted photo
uso
THIS_IS_A_URL THIS_IS_A_URL
###### vergonha ---> shame
HASHTAG_aécio45
publicar
de…
vez
te
marina
MENTION sim
calma
dom
###### manifestação ---> manifestation
amanhã
propaganda
fifa
cabeça
nós
###### band ---> news channel
hoje
###### petrobras ---> the biggest company which sells gas, etc http://www.petrobras.com/en/home.htm
###### HASHTAG_panelacohoje ---> hit on the pots today
vontade
pessoas
hahahaha
###### petistas ---> people from dilma's party
pe
feliz
lua
pão
###### sbt ---> news channel
contigo
HASHTAG_quintacomvalentinonosdv
desistir
##### protesto ----> protest
MENTION amiga
bater
#####  partido --> party
whats
brother
MENTION MENTION
aula
THIS_IS_A_URL photo
rede
ato

I tested several configurations checking their vocabulary but at the end I used the same configuration you advised. I could get accuracy of 60% with gap_sz=150, but I prefered the vocabulary using 100.

For testing set: I changed it to read a list of tweets up to first to use some keyword and then I read the list starting at the end of it, joining each 20, making groups and jumps like we previously discussed.

Here are the plots for testing instances: see Plots/

figures named as a-j are adding up the plots
figures named as aa-jj are from each step of cv

PS.: I am getting just 275 users in total at testing time, I am checking why.

ElaineResende commented 8 years ago

Our final model at this part is:

Model characteristics: collapse_mentions=True, collapse_digits=False, binary=False, ngram_range=(1,2), min_df=2, use_idf=True, norm='l2', window_sz=20, gap_sz=100

In the initial tests this model accuracy was 0.575769, after adding stopwords we get 0.588205.

Some users' first post comes with one of the keywords, so they are not considered, then we have at the end of testing time, 312 users.

Nb Classification-Optimize-Window has precision, recall, f-score and a confusion matrix for the model.

In Classification-testing I added the feature of neighbors hashtag use in the method iterate_instances_changed and I yield the result. How can I proceed in order to add this feature to our model?

aronwc commented 8 years ago

How can I proceed in order to add this feature to our model?

It looks like iterate_instances_changed returns the neighbor feature in the iterator. What you'll need to do is store that result in a list (like is done for users: x[0] for x in iterator if not users.append(x[2] and not neighbors.append(x[3]))

Then, you'll have to append a new column to the X matrix returned by the vectorizer. Something like this:

>>> from scipy.sparse import hstack
>>> X = hstack((X, np.array([neighbors]).T)).tocsr()

ElaineResende commented 8 years ago

I did it like you advised. But for fit_transform I get one column more, and when I try to use predict_proba for the testing set I get an error because of the difference of dimension for columns. Did I do anything wrong?

ValueError: X has 6850 features per sample; expecting 6851

aronwc commented 8 years ago

Make sure that:

you add the column prior to training, and
you add the column for each testing instance you create with .transform

On Tue, Oct 20, 2015 at 8:16 PM, ElaineResende notifications@github.com wrote:

I did it like you advised. But for fit_transform I get one column more, and when I try to use predict_proba for the testing set I get an error because of the difference of dimension for columns. Did I do anything wrong?

ValueError: X has 6850 features per sample; expecting 6851

— Reply to this email directly or view it on GitHub https://github.com/tapilab/protest/issues/17#issuecomment-149750839.

ElaineResende commented 8 years ago

When I consider just the neighbors feature for vectorizing I got an error (tokenizer =None, stopwords=None, lowercase=False):

TypeError: expected string or buffer

What I did to get it to work is to change the feature to string, but I believe it is not correct. With this approach the result of vocabulary is:

[('16666666666666666', 0.40124891749570202)]

What should I change to have the features as float?

aronwc commented 8 years ago

To use just the neighbor features, you do not need to run the vectorizer.transform method. Instead, just create a 2d array with the neighbor float values. E.g., [[.40], [.12], ...]

On Fri, Oct 23, 2015 at 7:06 AM, ElaineResende notifications@github.com wrote:

When I consider just the neighbors feature for vectorizing I got an error (tokenizer =None, stopwords=None, lowercase=False):

TypeError: expected string or buffer

What I did to get it to work is to change the feature to string, but I believe it is not correct. With this approach the result of vocabulary is:

[('16666666666666666', 0.40124891749570202)]

What should I change to have the features as float?

— Reply to this email directly or view it on GitHub https://github.com/tapilab/protest/issues/17#issuecomment-150555419.

ElaineResende commented 8 years ago

That is true! I am not thinking :( . I am so sorry for the dumb question.

aronwc commented 8 years ago

No worries! If this stuff were easy, I would be out of a job.

ElaineResende commented 8 years ago

I checked the model coefficient for different windows' sizes because the smaller the window is, less neighbors we have (different of 0) we have. What I understand from these coefficients is that the neighbors feature is important for our model and we can use it .

For clustering I am analysing different numbers and the centroids. I found interesting that the user Ary_AntiPT is in all different clusterings, this user's name means Anti PT (dilma's party) and he has 163 neighbors.

UPDATE:

Plots for each cluster are in Clustering Plots

File name example: clf3_0.png, where 3 is the quantity of clusters and 0 is the number of the cluster (0,1 or 2 in this case)

aronwc commented 8 years ago

Great! I think the Clustering Plots are numbered in reverse order? E.g., clf8_7 seems to have 9 users, which would be the first row in the 8 clusters table.

Could you please also create three additional plots showing just the centroids of each cluster? (e.g., for 3 clusters, you would have one plot just showing cassiosalva, Moreira and Ary).

ElaineResende commented 8 years ago

Sorry, the table was not in order. I updated it now. New plots you have asked are in Centroids. plots: 3 clusters, 6 clusters and 8 clusters.

And I added plots for each user(centroid) alone.

ElaineResende commented 8 years ago

For each model of cross-validation I used the metrics of precision, recall and f-score and I calculated the mean and standard deviation for all 10 models:

I am analysing the dendrogram below created with our data. Based on it we can create 4 or 5 clusters. I am still working on it and I hope find some interesting info from these groups.

ElaineResende commented 8 years ago

Precision, recall and f-score correction:

tapilab / protest

Improve classifier #17