jwzimmer-zz / aboutvsof

0 stars 0 forks source link

QAnon tweets: logistic regression model with TFIDF #6

Open jwzimmer-zz opened 3 years ago

jwzimmer-zz commented 3 years ago

Re discussion with Josh Minot, want to use a logistic regression model to identify some words and potentially ngrams which can be used to differentiate Q and non-Q content on twitter, purpose being to support/ enhance the words manually found.

My noob notes

jwzimmer-zz commented 3 years ago

Started local notebook for this: QTweetsLogisticRegressionTFIDF

jwzimmer-zz commented 3 years ago

Using just a small number of tweets and the body of the tweet rather than the better get_text result.

With Count (not TFIDF):

Screen Shot 2021-01-31 at 5 29 06 PM

With TFIDF (not Count):

Screen Shot 2021-01-31 at 5 31 54 PM

Here's how I'm finding and labelling "Qanon" and "not-Qanon" tweets:

train2 = train.copy() #so if I mess it up I don't have to re-run getting train

def label_qanon(row):
    if type(row['body']) == str:
        if bool(re.search(r'qanon|Qanon|QANON|QAnon',row['body'])):
            return 1
        else:
            return 0
    else:
        return 0

train2['qanon'] = train2.apply(lambda row: label_qanon(row), axis=1)

train2.head()
jwzimmer-zz commented 3 years ago

Meeting 2/2/21:

jwzimmer-zz commented 3 years ago

Using all the tweets from the accounts that have a matching qanon tweet and TFIDF:

Screen Shot 2021-02-01 at 7 05 37 PM

Here's the relevant code... Here's how I'm finding and labelling the Qanon accounts and their tweets (vs. everything else is a non-Qanon tweet):

train3 = train.copy()

def find_qanon_accounts(df):
    qanonaccounts = []
    df = df.copy()
    for row in df.iterrows():
        if type(row[1]['body']) == str:
            if bool(re.search(r'qanon|Qanon|QANON|QAnon',row[1]['body'])):
                username = row[1]['actor']['preferredUsername']
                qanonaccounts += [username]
            else:
                pass
        else:
            pass
    return list(set(qanonaccounts))

qanonaccounts = find_qanon_accounts(train3)

def label_qanon(row):
    if row['actor']['preferredUsername'] in set(qanonaccounts):
        return 1
    else:
        return 0

train3['qanon'] = train3.apply(lambda row: label_qanon(row), axis=1)

train3.head()
jwzimmer-zz commented 3 years ago

Using 2grams as features + TFIDF + QAnon accounts:

Training Accuracy: 0.999388402757171

Screen Shot 2021-02-08 at 1 12 58 PM Screen Shot 2021-02-08 at 1 13 14 PM
jwzimmer-zz commented 3 years ago
Screen Shot 2021-02-08 at 1 25 58 PM
jwzimmer-zz commented 3 years ago

Add list of control accounts from https://twitter.com/i/lists/1350837342622375936/members: write_json(controlaccts,"list_control_Q_accts_02-08-2021.json")

Screen Shot 2021-02-08 at 3 04 24 PM

I think this is ready to run/ works, EXCEPT that none of the control accounts happen to be in the tweet sample Josh gave me, so I get an error that there is only 1 class of data (the Q account tweets).