tapilab / protest

analyze brazilian protests on Twitter
0 stars 0 forks source link

Add percent of tweets matching terms in Table 4.1 #9

Closed aronwc closed 9 years ago

aronwc commented 9 years ago

For each term, print the percent of tweets from that month matching that term. E.g., for April, courtney may match 10% of all tweets.

Here's one numpy way of counting (assuming X is a binary matrix, one row per tweet, and y contains the months associated with each tweet).

X[np.where(y=='april')].sum(axis=0)

ElaineResende commented 9 years ago

I have been trying to do it, but I am having some troubles.

My idea was add Y as a new column, but since X is huge (265831x32596) is impossible to have it in memory using todense(), for example.

Do you have any suggestions?

Thanks in advance.

aronwc commented 9 years ago

No need to make X dense. X has shape [num tweets, num feats] , which should be output of count vectorizer. Y has shape [num tweets]. Each entry is the month that tweet was written.

There are multiple ways to do this. If there's another way that is easier for you, that's fine too.

On Jun 30, 2015, at 7:06 PM, ElaineResende notifications@github.com wrote:

I have been trying to do it, but I am having some troubles.

X is a binary matrix, so what I have done is just use countvectorizer and fit transform to get the csr matrix. To get Y I get an array with the posted_time My idea was add Y as a new column, but since X is huge (265831x32596) is impossible to have it in memory using todense(), for example.

Do you have any suggestions?

Thanks in advance.

— Reply to this email directly or view it on GitHub.

ElaineResende commented 9 years ago

Positive tweets:

image

Negative tweets:

image

Positive + Negative:

image

PS.: I am sorry, I thought I already have had sent this.