Implement the well-being classifier

vlandeiro commented 10 years ago

https://github.com/virgile11/sporty-twitters/issues?milestone=6

aronwc commented 10 years ago

Sample ~1K tweets containing any of the anxiety/depression keywords
Label as pos/neg.
Train a classifier
Report cross-validation accuracy.

vlandeiro commented 10 years ago

one line in the feature matrix: features for one tweet

should I use stemming? how do I weight the features? tf, tf-idf what kind of features filtering? top infogain words evaluation: F1

list of possible features to discuss:

tweet features:
- unigrams
- length of the tweet (small, medium, long)
- use hashtags (yes/no)
- is a retweet (yes/no)
- mentions users (yes/no)
user features:
- favorites
- followers
- followees

just an idea: we have access to all the colors defined by the user to display its profile so why not try and use the average luminosity of the profile colors as a feature to see if it helps us to classify depressed users.

aronwc commented 10 years ago

tokenization:

https://github.com/aronwc/twutil/blob/master/twutil/preprocess.py
- collapses mentions and url tokens
- maintains (some) emoticons
skip stemming
start with unigrams
ignore tf-idf for now
add flag to allow ignoring retweets

If we have users before/after they started using the app, we can compute engagement features like:

tweet frequency
retweet frequency
mention frequency

We'll want to scale X matrix

aronwc commented 10 years ago

tokenizer done.
CountVectorizer in progress (just adding the binary non-word features)

aronwc commented 10 years ago

~75% accuracy on 250 labeled tweets
We discussed labeling TA/DD/AH separately.
- expand AH
- add to twitter tracking
Remove features occurring once in labeled data.
Print out top features per class
Sample Twitter users who
- use an exercise app
- have more than N tweets not from an exercise app

vlandeiro commented 10 years ago

Modified DictVectorizer into TfidfVectorizer in order to remove features
Labeling of TA/DD/AH:
- expanded TA and AH
- collected tweets with TA/DD/AH keywords
- started labeling tweets
Added multilabel classification
Added functionality to get the top features of a classifier
Rewrote the code into an API instead of CLI

vlandeiro commented 10 years ago

Done:

2000+ tweets labeled on AH, DD, and TA dimensions
added bigrams to features
added emoticons to features
added collect of tweets and list of friends for a given user via the CLI
collected a list of ~68K unique users that have used a sports app
started collecting tweets and list of friends for each of these 68K users
document the code and complete the README
create the setup.py
solve problem when adding bigrams to features: #16

To do:

continue labeling to reach 3K labeled tweets
start reading papers about classifying the gender/age of users

aronwc commented 10 years ago

print out some errors to brainstorm better features
report f1/pr/recall for positive class.

vlandeiro commented 10 years ago

Benchmark and CLI modified to:

print N misclassified tweets
report scorings for the positive class
choose the classifier amongst SVM, Logistic Regression, K-Nearest Neighbors, Decision Tree, and Naive Bayes

vlandeiro commented 10 years ago

Done:

print confusion matrix
print the weight for each feature of a misclassified tweet
try to do a one label classifier: new label = (AH or DD or TA)
do feature selection

vlandeiro commented 10 years ago

When doing a benchmark, I was only returning the first ROC_AUC score. Now, I return the average of all ROC_AUC scores (average over AH, DD, and TA for given parameters.

Run test script improved:

tree structure StatsTree to make it easily changeable
one node StatsNode for one parameter of the command line
each node has:
- the parameter name (str)
- a dictionary that matches every option for this parameter to the vector to add to the argvector for this option (dict)
- a dictionary that matches every option for this parameter to the following node (str) in the tree for this option OR a string if every option leads to the same node OR None if there is no following node.
when the tree is built, we can run a depth-first traversal using StatsTree.traverse(func):
- the func parameter is a function taking one parameter that is executed in each leaf of the tree.
- the parameter cmd passed to the function func is the arg vector built at this moment.

vlandeiro / sporty-twitters

Implement the well-being classifier #10

favorites

followers

followees