Setup a classification problem for "active"

aronwc commented 9 years ago

Predict whether a user will tweet one of the anti-dilma keywords for the first time, based on their prior tweets.

aronwc commented 9 years ago

Let tweets

We want to create a labeled dataset to predict whether the next tweet a user posts will contain one of the anti-government keywords. We can do this as follows:

For each user U:

Identify the tweet T that is U's first use of one of the hashtags
Collect all tweets U posted prior to T. Call this T_u. This becomes a positive example.
To construct a negative example, remove from T_u the 10 tweets that occurred just prior to T.

To construct a feature vector from a list of a user's tweets, concatenate all tweets together and use TfidfVectorizer, as we did in the ecig experiments.

Here's an example:

User 1: 
   Tweet 1: a
   Tweet 2: a
   Tweet 3: b
   Tweet 4: b
   Tweet 5: c
   Tweet 6: c
   Tweet 7: d
   Tweet 8: d
   Tweet 9: e
   Tweet 10: e
   Tweet 11: f
   Tweet 12: #dilmaout
   Tweet 13: g

The positive example will consist of tweets 1-11, while the negative example will only contain tweet 1.

We may have to play with how we select the negative tweet, depending on how many tweets each user has (e.g., if many users have fewer than 10 tweets, then we obviously can't construct a negative example). Instead, we can consider using a percentage (e.g., the negative example as 10% fewer tweets than the positive example).

ElaineResende commented 9 years ago

I have problems with encoding. When I read or write I put encoding='utf-8'. After building the feature vector I got them as non-encoded in the notebook, like example below. Tfidfvectorizor has utf-8 as parameter, but I still get non-encoded. Do you have any tips for this situation?

E.g.: u'84 anos', u'aben\xe7oe', u'acesse', u'acesse http', u'acho',

aronwc commented 9 years ago

Point me to a sample file and the code you are using.

ElaineResende commented 9 years ago

Code is on github in: protest/Brazil project/Classification.ipynb sample file: DaviWesler.txt.txt in the same path of notebook file.

aronwc commented 9 years ago

It looks like you've got proper utf-8 in your vocabulary. I think there's just something wrong with the default way the notebook is printing. See screen shot below. I wouldn't worry about this.

screen shot 2015-09-16 at 3 47 46 pm

ElaineResende commented 9 years ago

That's weird. Thank you.

Other doubt,

Should I add a new key to each tweet in order to have labels?Right? If so, how to add labels to those belonging to negative and positive? I mean, the negative has the same as positive - 10, if I did correctly, so how can I proceed?

aronwc commented 9 years ago

It may be easier to have a separate vector for labels, rather than storing them in the tweet json directly (for the reason you state). So, for each user, you will generate two feature vectors (x1, x2) as well as two labels (1, 0). You would accumulate these into a feature matrix X and label vector y.

ElaineResende commented 9 years ago

To be sure I am doing it right.

I have a dictionary which each key is one user and the value is two tuples. The first tuple is two feature vectors X1=positive sample and X2=negavite sample and the second tuple is the labels (1 for X1 and 0 for X2). Is that right?

Example for 3 users: defaultdict(<type 'tuple'>, {'antonionetoAC.txt.txt': ((<863x2252 sparse matrix of type '<type 'numpy.float64'>'with 12307 stored elements in Compressed Sparse Row format>, <853x2237 sparse matrix of type '<type 'numpy.float64'>'with 12212 stored elements in Compressed Sparse Row format>), (1, 0)),

'AntoniCorreaa.txt.txt': ((<33x63 sparse matrix of type '<type 'numpy.float64'>'with 202 stored elements in Compressed Sparse Row format>, <23x48 sparse matrix of type '<type 'numpy.float64'>'with 139 stored elements in Compressed Sparse Row format>), (1, 0)),

'ariadnimariano.txt.txt': ((<6559x11736 sparse matrix of type '<type 'numpy.float64'>'with 89121 stored elements in Compressed Sparse Row format>, <6549x11722 sparse matrix of type '<type 'numpy.float64'>'with 89000 stored elements in Compressed Sparse Row format>), (1, 0))})

aronwc commented 9 years ago

Yes, that should work. You'll then need to stack all the x vectors together into one X matrix, and also concatenate all the label values together into one y vector (making sure that the y's are aligned with the proper xi's). Then, you can do cross-validation.

We have to be a bit careful in cross-validation -- we want to prevent the same user from appearing in both the training and testing set in a fold.

On Thu, Sep 17, 2015 at 10:30 AM, ElaineResende notifications@github.com wrote:

To be sure I did it right.

I have a dictionary which each key is one user and the value is two tuples. That the first tuple is two feature vector X1=positive sample and X2=negavite sample and the second tuple is the labels (1 for X1 and 0 for X2). Is that right?

Example for 3 users: defaultdict(, {'antonionetoAC.txt.txt': ((<863x2252 sparse matrix of type ''with 12307 stored elements in Compressed Sparse Row format>, <853x2237 sparse matrix of type ''with 12212 stored elements in Compressed Sparse Row format>), (1, 0)),

'AntoniCorreaa.txt.txt': ((<33x63 sparse matrix of type ''with 202 stored elements in Compressed Sparse Row format>, <23x48 sparse matrix of type ''with 139 stored elements in Compressed Sparse Row format>), (1, 0)),

'ariadnimariano.txt.txt': ((<6559x11736 sparse matrix of type ''with 89121 stored elements in Compressed Sparse Row format>, <6549x11722 sparse matrix of type ''with 89000 stored elements in Compressed Sparse Row format>), (1, 0))})

— Reply to this email directly or view it on GitHub https://github.com/tapilab/protest/issues/15#issuecomment-141122964.

aronwc commented 9 years ago

Aron will attempt to modify Classification.ipynb to make create of X matrix more efficient (time and space.

aronwc commented 9 years ago

OK, I've updated Classification.ipynb so that feature vectors can be created in ~2 minutes. The accuracy is pretty bad, as expected. I tried out a few window sizes (e.g., the number of tweets to remove to get a negative example), and it has a modest effect.

Results:

negative_window	train acc	test acc	n users
10	.591	.507	285
20	.610	.509	274
30	.614	.515	264

Please see #17 for possible ways of improving this accuracy.

aronwc commented 9 years ago

Based on google translate, some of the top terms make sense. e.g.

fracos incompetentes
bundões

Though, I don't get these, perhaps you can shed light?

panelinhas (positive)
um comunista (negative)

ElaineResende commented 9 years ago

Wow, thank you so much. My coding seems so bad compared to yours.

Looking at the words: Panelinhas means - a group of people who stick together. Probably, in this case, referring to Dilma and her party Um comunista - Several people mention Dilma, the ex-president, called Lula, and their party as communists. (I will read some tweets to be sure that this is the meaning)

I hope I could help you to understand the meaning. Thank you very much for your help. I am going to try to improve our results now.

tapilab / protest

Setup a classification problem for "active" #15