tapilab / protest

analyze brazilian protests on Twitter
0 stars 0 forks source link

Run final classifier on all unlabeled tweets #3

Closed aronwc closed 9 years ago

aronwc commented 9 years ago

Using best classifier from #2, classify remaining tweets and report a number of statistics, such as:

aronwc commented 9 years ago

Then, for each month, we can compute:

ElaineResende commented 9 years ago

All tweets = [30025, 36995, 46271, 62883, 65971, 90870, 110735, 84692, 96447, 94180, 114894, 158669] .Total: 992632 Positive tweets = [8493, 10135, 13116, 17242, 18652, 22375, 28034, 24310, 23276, 26574, 31461, 42163] .Total: 265831 Negative tweets = [21532, 26860, 33155, 45641, 47319, 68495, 82701, 60382, 73171, 67606, 83433, 116506] .Total: 726801

sentiment graphic

26.78% of tweets were classified as positive 73.22% of tweets were classified as negative/neutral

aronwc commented 9 years ago

This looks good! Could you please do an additional version where you only count each user once? That is, if a user appears more than once, only consider their first tweet.

On Jun 4, 2015, at 2:07 PM, ElaineResende notifications@github.com wrote:

All tweets = [30025, 36995, 46271, 62883, 65971, 90870, 110735, 84692, 96447, 94180, 114894, 158669] .Total: 992632 Positive tweets = [8493, 10135, 13116, 17242, 18652, 22375, 28034, 24310, 23276, 26574, 31461, 42163] .Total: 265831 Negative tweets = [21532, 26860, 33155, 45641, 47319, 68495, 82701, 60382, 73171, 67606, 83433, 116506] .Total: 726801

— Reply to this email directly or view it on GitHub.

ElaineResende commented 9 years ago

Considering one tweet per user 1tweet_peruser

Below is correct!! image

ElaineResende commented 9 years ago

Not considering retweets

nort s

aronwc commented 9 years ago

Great -- it looks like the "spike" switched from Apr'13 to Mar'13. Is this due to the filtering, or was there a problem in the original graph? Perhaps re-generate the original graph?

ElaineResende commented 9 years ago

I think it is due to the filtering, but I will double check.

aronwc commented 9 years ago

I see -- the March spike still exists in the first plot. Perhaps the April spike was due to the same users tweeting many times in April.

ElaineResende commented 9 years ago

Sorry I just saw your last comment now. Yes, I double checked and I believe the same.

aronwc commented 9 years ago

Couple last things

Thanks!

aronwc commented 9 years ago
aronwc commented 9 years ago

To deal with users who tweet more than once:

So, the y-axis is number of users who have tweeted at least one positive tweet this month.

ElaineResende commented 9 years ago

It follows what I have done:

1) First, I got 574318 tweets from users who tweeted more than once, they represent 57.8% of total tweets. 2) After that I have created a dictionary with all of those users as key and the class label and posted time as the values. The size of this dictionary is 113340 users (considering RT's). 3) Then I got just the keys which have positive class as value for the users. From that, I got 35145 tweets that were positively classified. 4) At the end I counted the number of tweets by month

The graph of those tweets is shown below.

1

Not considering RT's 2

ElaineResende commented 9 years ago

For March 2013 I have done n-grams

bigrams = [(('?', '?'), 30949), (('http', ':'), 23120), (('RT', '@'), 17071), (('electronic', 'cigarette'), 5975), (('...', 'http'), 5903), (('!', '!'), 3918), (('an', 'electronic'), 3189), (('?', 'http'), 3179), (('green', 'smoke'), 2848), (('electronic', 'cigarettes'), 2809)]

trigrams = [(('?', '?', '?'), 24862), (('...', 'http', ':'), 5898), (('?', 'http', ':'), 3169), (('an', 'electronic', 'cigarette'), 2957), (('smoking', 'an', 'electronic'), 2171), (('electronic', 'cigarette', '?'), 2039), ((':', 'http', ':'), 1952), (('!', '!', '!'), 1877), (('&', 'amp', ';'), 1747), (('caught', 'smoking', 'an'), 1727)]

4-grams = [(('?', '?', '?', '?'), 21847), (('smoking', 'an', 'electronic', 'cigarette'), 2124), (('an', 'electronic', 'cigarette', '?'), 1810), (('caught', 'smoking', 'an', 'electronic'), 1713), (('Onew', 'caught', 'smoking', 'an'), 1695), (('cigarette', '?', 'http', ':'), 1609), (('electronic', 'cigarette', '?', 'http'), 1606), ((':', 'Onew', 'caught', 'smoking'), 1440), (('allkpop', ':', 'Onew', 'caught'), 1379), (('@', 'allkpop', ':', 'Onew'), 1346)]

People talked a lot about Onew being caught vaping.

3 onew

Also, I have generated a word cloud on the internet.

wordle

ElaineResende commented 9 years ago

September has the greatest number of retweets which is 22196 .

Bigrams = [(('?', '?'), 62409), (('http', ':'), 35810), (('RT', '@'), 23660), (('!', '!'), 11628), (('#', 'Ecigs'), 6395), (('.', '#'), 5641), (('quit', 'smoking'), 4938), (('?', 'http'), 4912), (('...', 'http'), 4784), (('amp', ';'), 4608)]

Trigrams = [(('?', '?', '?'), 47450), (('!', '!', '!'), 5684), (('?', 'http', ':'), 4908), (('...', 'http', ':'), 4733), (('&', 'amp', ';'), 4604), (('to', 'quit', 'smoking'), 3061), (('.', '#', 'Ecigs'), 2926), (('the', '#', 'FDA'), 2621), (('quit', 'smoking', '.'), 2445), (('.', 'http', ':'), 2428)]

4grams = [(('?', '?', '?', '?'), 40319), (('smoking', '.', '#', 'Ecigs'), 2341), (('the', '#', 'FDA', 'to'), 2137), (('to', 'quit', 'smoking', '.'), 2130), (('#', 'FDA', 'to', 'limit'), 2128), (('millions', 'to', 'quit', 'smoking'), 2119), (('a', 'product', 'that', 'is'), 2117), (('that', 'is', 'getting', 'millions'), 2115), (('is', 'getting', 'millions', 'to'), 2115), (('product', 'that', 'is', 'getting'), 2115)]

wordle9

aronwc commented 9 years ago

Great work, thanks!

Based on your ngrams, the September spike appears to be strongly influenced by a push by states Attorneys General to encourage the FDA to regulate ecigarettes. See this story: http://articles.latimes.com/2013/sep/24/business/la-fi-mo-electronic-cigarette-attorney-general-20130924

It appears that people who are pro-ecigs launched a Twitter campaign to discourage this effort, by having people send messages to the Attorneys General saying, e.g.,

@INATTYGENERAL Why did you sign a letter asking the #FDA to limit a product that is getting millions to quit smoking. #Ecigs save lives!!!

I also found a discussion on Reddit where pro-ecig users are discussing how to best conduct their campaign: http://www.reddit.com/r/electronic_cigarette/comments/1n3k7r/fight_for_your_right_to_vape_daily_action_plan/

Are those n-grams only for those tweets classified as positive?

aronwc commented 9 years ago

If a user tweets 10 times in a month, does that contribute 10 or 1 to the y-axis in your graph?

Also, can you please regenerate the graph without limiting to users who have tweeted more than once? (i.e., include all users).

Finally, please create a separate graph showing the percentage of positive users per month. (i.e., of all the users who posted tweets about ecigs this month, what percentage posted at least one positive tweet?)

ElaineResende commented 9 years ago

I am sorry for n-grams I had considered all tweets. Now, I consider just positive tweets. So, we have total 15799 tweets for March.

bigrams = [(('?', '?'), 4336), (('RT', '@'), 1769), (('http', ':'), 1638), (('I', "'m"), 1300), (('!', '!'), 1225), (('electronic', 'cigarette'), 1071), (('e', 'cig'), 988), (('.', 'I'), 777), (('.', '#'), 729), (('electric', 'cigarette'), 703)]

trigrams = [(('?', '?', '?'), 2871), (('!', '!', '!'), 578), (('&', 'amp', ';'), 483), (('...', 'http', ':'), 414), (('good', 'for', 'me'), 354), (('too', 'good', 'for'), 354), (('because', 'i', 'used'), 353), (('electronic', 'cigarette', 'which'), 353), (('BlackFriday', 'is', 'too'), 353), (('i', 'used', 'electronic'), 353)]

4grams = [(('?', '?', '?', '?'), 2184), (('BlackFriday', '#', 'BlackFriday', 'is'), 353), (('good', 'for', 'me', 'because'), 353), (('BlackFriday', 'is', 'too', 'good'), 353), (('too', 'good', 'for', 'me'), 353), (('because', 'i', 'used', 'electronic'), 353), (('me', 'because', 'i', 'used'), 353), (('electronic', 'cigarette', 'which', 'i'), 353), (('for', 'me', 'because', 'i'), 353), (('which', 'i', 'missed', 'a'), 353)]

And for September we have 29222 total positive tweets

bigrams = [(('?', '?'), 9916), (('!', '!'), 6066), (('#', 'Ecigs'), 4373), (('quit', 'smoking'), 4223), (('RT', '@'), 3942), (('.', '#'), 3441), (('the', '#'), 3221), (('to', 'quit'), 2931), (('#', 'FDA'), 2803), (('I', "'m"), 2594)]

trigrams = [(('?', '?', '?'), 6491), (('!', '!', '!'), 2841), (('to', 'quit', 'smoking'), 2764), (('.', '#', 'Ecigs'), 2645), (('the', '#', 'FDA'), 2593), (('quit', 'smoking', '.'), 2308), (('#', 'FDA', 'to'), 2130), (('FDA', 'to', 'limit'), 2128), (('millions', 'to', 'quit'), 2125), (('a', 'product', 'that'), 2117)]

4grams = [(('?', '?', '?', '?'), 4579), (('#', 'FDA', 'to', 'limit'), 2128), (('the', '#', 'FDA', 'to'), 2128), (('millions', 'to', 'quit', 'smoking'), 2119), (('a', 'product', 'that', 'is'), 2116), (('that', 'is', 'getting', 'millions'), 2115), (('getting', 'millions', 'to', 'quit'), 2115), (('product', 'that', 'is', 'getting'), 2115), (('is', 'getting', 'millions', 'to'), 2115), (('a', 'letter', 'asking', 'the'), 2102)]

ElaineResende commented 9 years ago

Q: If a user tweets 10 times in a month, does that contribute 10 or 1 to the y-axis in your graph? A: It considers 10. I am going to fix it for 1.

Edit:

What I have done to fix it is: Before I was creating a dictionary like this: d[u] = label, date, where u is the 'username', label is the class, 'date is the posted time'. And then I created like this: d[u,label,date] As I understand I have one key for each user by month. So, after that I deleted the keys which had label equal to 0. At the end the size of the dictionary is 199382, and I sort by month and plot the graph. ![new](https://cloud.githubusercontent.com/assets/8547396/8164936/08d5efd6-1352-11e5-946b-203374382b11.png)

ElaineResende commented 9 years ago

image

aronwc commented 9 years ago

Beautiful! Can you please make the following small changes:

ElaineResende commented 9 years ago

image