Create plot showing probability(protest) over time

aronwc commented 9 years ago

After fitting classifier on 80%
- for each user in remaining 20%
- generate feature vectors for all tweets up to protest tweet (e.g., tweet 1, tweet 1-2, tweet 1-3,...)
- Use classifier to predict_proba of positive class
- Plot these probabilities for each user (x axis is no. tweets; y axis is proba(positive))

ElaineResende commented 9 years ago

The testing set was created for each user with at most 1500 tweets because if I get more tweets, a memory error is showed. the plot of probabilities is shown below. Is this what you have in mind?

aronwc commented 9 years ago

yes, neat! I wonder why the probabilities plateau so quickly (I'd expect to still see more for x > 400). Are we using binary or tf-idf term vectors?

We also need a way to adjust for the different number of tweets for user. Perhaps x axis becomes % of total tweets?

ElaineResende commented 9 years ago

We are using tf-idf. Sure, I will change the x axis.

ElaineResende commented 9 years ago

Well, now I have this plot, does it look like to what you were expecting?

aronwc commented 9 years ago

Not sure what I was expecting, but this looks interesting.

We're looking for two things:

Are there clusters of users with similar trajectories? E.g., I would expect some early adopters, who talk about nothing but the protest, and some late adopters who don't say much about the protest before hand.
What are the tweets that led to the biggest increase in the predicted probabilities?

ElaineResende commented 9 years ago

These points I got in our discussion last week. Probably I have done something wrong because there is no spike or groups. Or maybe the classifier did not help. The probabilities I printed are very similar. But I am going to take a further look at it, maybe I missed something.

Our testing set is: For each user U:

get all tweets up to fist hashtag
for each tweet I join with next tweet
- for example: tweet1, tweet1-2, tweet1-3, tweet1-n, where n should be the tweet next to the the first tweet with one of the hashtags.

aronwc commented 9 years ago

I'm looking at the Classification-testing notebook now. It looks right. I would probably change this part:

            if i > negative_window and matches_keywords(js['text'], keywords):
                print(user)
                var=''
                for l in lines[-100:][::-1]:  # just look at the most recent 100 tweets.

I agree that running for all 3200 tweets is too much. It looks like you were using the first 1500 tweets, instead of the last 1500 tweets.

The code above just uses the most recent 100 tweets, which should be enough.

Another thing: It looks like you train on a random sample of users then test on the remaining. I'd recommend doing this multiple times to get predictions for all users. One way to do this is to read in all the users at training time. Then, do a cross-validation loop, training on X% of the users, then reading in the testing users and predicting on them.

I do think there are noticeable clusters of users, it's just hard to see at the moment.

ElaineResende commented 9 years ago

Thank you a lot for your help, I didn't figured that out. I did 10 fold cross-validation with kfold using length of all files (321) calling training and testing and I saved the plot for each step of cv. Plots are below.

ElaineResende commented 9 years ago

Doubt: The size of testing and training is 10% and 90%, respectively. Do you know how I can change the size of them using kfold?

aronwc commented 9 years ago

The size is fine; it just that you have to loop so all instances are in test set once.

n_folds will change size of train, test.

On Oct 15, 2015, at 8:57 AM, ElaineResende notifications@github.com wrote:

Doubt: The size of testing and training is 10% and 90%, respectively. Do you know how I can change the size of them using kfold?

— Reply to this email directly or view it on GitHub.

ElaineResende commented 9 years ago

Final plots for model below is in Plots_model1

Model characteristics: collapse_mentions=True, collapse_digits=False, binary=False, ngram_range=(1,2), min_df=2, use_idf=True, norm='l2', window_sz=20, gap_sz=100

figures named as a-j are adding up the plots
figures named as aa-jj are from each step of cv

tapilab / protest

Create plot showing probability(protest) over time #18