tapilab / protest

analyze brazilian protests on Twitter
0 stars 0 forks source link

Create plot showing probability(protest) over time #18

Open aronwc opened 8 years ago

aronwc commented 8 years ago
ElaineResende commented 8 years ago

The testing set was created for each user with at most 1500 tweets because if I get more tweets, a memory error is showed. the plot of probabilities is shown below. Is this what you have in mind?

image

aronwc commented 8 years ago

yes, neat! I wonder why the probabilities plateau so quickly (I'd expect to still see more for x > 400). Are we using binary or tf-idf term vectors?

We also need a way to adjust for the different number of tweets for user. Perhaps x axis becomes % of total tweets?

ElaineResende commented 8 years ago

We are using tf-idf. Sure, I will change the x axis.

ElaineResende commented 8 years ago

Well, now I have this plot, does it look like to what you were expecting?

image

aronwc commented 8 years ago

Not sure what I was expecting, but this looks interesting.

We're looking for two things:

ElaineResende commented 8 years ago

These points I got in our discussion last week. Probably I have done something wrong because there is no spike or groups. Or maybe the classifier did not help. The probabilities I printed are very similar. But I am going to take a further look at it, maybe I missed something.

Our testing set is: For each user U:

aronwc commented 8 years ago

I'm looking at the Classification-testing notebook now. It looks right. I would probably change this part:

            if i > negative_window and matches_keywords(js['text'], keywords):
                print(user)
                var=''
                for l in lines[-100:][::-1]:  # just look at the most recent 100 tweets. 

I agree that running for all 3200 tweets is too much. It looks like you were using the first 1500 tweets, instead of the last 1500 tweets.

The code above just uses the most recent 100 tweets, which should be enough.

Another thing: It looks like you train on a random sample of users then test on the remaining. I'd recommend doing this multiple times to get predictions for all users. One way to do this is to read in all the users at training time. Then, do a cross-validation loop, training on X% of the users, then reading in the testing users and predicting on them.

I do think there are noticeable clusters of users, it's just hard to see at the moment.

ElaineResende commented 8 years ago

Thank you a lot for your help, I didn't figured that out. I did 10 fold cross-validation with kfold using length of all files (321) calling training and testing and I saved the plot for each step of cv. Plots are below.

a b c d e f g h i j

j

ElaineResende commented 8 years ago

Doubt: The size of testing and training is 10% and 90%, respectively. Do you know how I can change the size of them using kfold?

aronwc commented 8 years ago

The size is fine; it just that you have to loop so all instances are in test set once.

n_folds will change size of train, test.

On Oct 15, 2015, at 8:57 AM, ElaineResende notifications@github.com wrote:

Doubt: The size of testing and training is 10% and 90%, respectively. Do you know how I can change the size of them using kfold?

— Reply to this email directly or view it on GitHub.

ElaineResende commented 8 years ago

Final plots for model below is in Plots_model1

Model characteristics: collapse_mentions=True, collapse_digits=False, binary=False, ngram_range=(1,2), min_df=2, use_idf=True, norm='l2', window_sz=20, gap_sz=100