QAnon tweets: logistic regression model with TFIDF

jwzimmer-zz commented 3 years ago

Re discussion with Josh Minot, want to use a logistic regression model to identify some words and potentially ngrams which can be used to differentiate Q and non-Q content on twitter, purpose being to support/ enhance the words manually found.

My noob notes

document embedding/ word embedding
semantic space
bert
word2vec
glove embeddings
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html#examples-using-sklearn-feature-extraction-text-tfidfvectorizer
https://gitlab.com/compstorylab/tweet_utils/-/blob/master/twitter_utils/tweets.py
- read_text()
Downloaded some initial tweets to get started from Josh
- Do not commit tweets to repo
- Do not commit twitter usernames to repo
Tweets take up a lot of space; need to interact with the compressed file
Tweets are surprisingly long json objects with a lot of information I can probably ignore
Can save the exact version of the model as a pickled object

jwzimmer-zz commented 3 years ago

Started local notebook for this: QTweetsLogisticRegressionTFIDF

jwzimmer-zz commented 3 years ago

Using just a small number of tweets and the body of the tweet rather than the better get_text result.

With Count (not TFIDF):

With TFIDF (not Count):

Here's how I'm finding and labelling "Qanon" and "not-Qanon" tweets:

train2 = train.copy() #so if I mess it up I don't have to re-run getting train

def label_qanon(row):
    if type(row['body']) == str:
        if bool(re.search(r'qanon|Qanon|QANON|QAnon',row['body'])):
            return 1
        else:
            return 0
    else:
        return 0

train2['qanon'] = train2.apply(lambda row: label_qanon(row), axis=1)

train2.head()

jwzimmer-zz commented 3 years ago

Meeting 2/2/21:

"About" qanon: https://twitter.com/i/lists/1350837342622375936
- try as "non-q" class in model
- set up different comparison classes
topic model with sklearn
email chris for vacc access
get set up on vacc - if there is a meeting about how to do this, invite Jane
set up 2 grams
set up 3 grams
check out plotly: https://plotly.com/python/plotly-express/
keep a lot of charts

jwzimmer-zz commented 3 years ago

Using all the tweets from the accounts that have a matching qanon tweet and TFIDF:

Here's the relevant code... Here's how I'm finding and labelling the Qanon accounts and their tweets (vs. everything else is a non-Qanon tweet):

train3 = train.copy()

def find_qanon_accounts(df):
    qanonaccounts = []
    df = df.copy()
    for row in df.iterrows():
        if type(row[1]['body']) == str:
            if bool(re.search(r'qanon|Qanon|QANON|QAnon',row[1]['body'])):
                username = row[1]['actor']['preferredUsername']
                qanonaccounts += [username]
            else:
                pass
        else:
            pass
    return list(set(qanonaccounts))

qanonaccounts = find_qanon_accounts(train3)

def label_qanon(row):
    if row['actor']['preferredUsername'] in set(qanonaccounts):
        return 1
    else:
        return 0

train3['qanon'] = train3.apply(lambda row: label_qanon(row), axis=1)

train3.head()

jwzimmer-zz commented 3 years ago

Using 2grams as features + TFIDF + QAnon accounts:

get runtime warning
get error with validation set if I use transform instead of fit_transform

Training Accuracy: 0.999388402757171

jwzimmer-zz commented 3 years ago

jwzimmer-zz commented 3 years ago

Add list of control accounts from https://twitter.com/i/lists/1350837342622375936/members: write_json(controlaccts,"list_control_Q_accts_02-08-2021.json")

I think this is ready to run/ works, EXCEPT that none of the control accounts happen to be in the tweet sample Josh gave me, so I get an error that there is only 1 class of data (the Q account tweets).

jwzimmer-zz / aboutvsof

QAnon tweets: logistic regression model with TFIDF #6