Open jwzimmer-zz opened 3 years ago
Started local notebook for this: QTweetsLogisticRegressionTFIDF
Using just a small number of tweets and the body of the tweet rather than the better get_text result.
With Count (not TFIDF):
With TFIDF (not Count):
Here's how I'm finding and labelling "Qanon" and "not-Qanon" tweets:
train2 = train.copy() #so if I mess it up I don't have to re-run getting train
def label_qanon(row):
if type(row['body']) == str:
if bool(re.search(r'qanon|Qanon|QANON|QAnon',row['body'])):
return 1
else:
return 0
else:
return 0
train2['qanon'] = train2.apply(lambda row: label_qanon(row), axis=1)
train2.head()
Meeting 2/2/21:
Using all the tweets from the accounts that have a matching qanon tweet and TFIDF:
Here's the relevant code... Here's how I'm finding and labelling the Qanon accounts and their tweets (vs. everything else is a non-Qanon tweet):
train3 = train.copy()
def find_qanon_accounts(df):
qanonaccounts = []
df = df.copy()
for row in df.iterrows():
if type(row[1]['body']) == str:
if bool(re.search(r'qanon|Qanon|QANON|QAnon',row[1]['body'])):
username = row[1]['actor']['preferredUsername']
qanonaccounts += [username]
else:
pass
else:
pass
return list(set(qanonaccounts))
qanonaccounts = find_qanon_accounts(train3)
def label_qanon(row):
if row['actor']['preferredUsername'] in set(qanonaccounts):
return 1
else:
return 0
train3['qanon'] = train3.apply(lambda row: label_qanon(row), axis=1)
train3.head()
Using 2grams as features + TFIDF + QAnon accounts:
Training Accuracy: 0.999388402757171
Add list of control accounts from https://twitter.com/i/lists/1350837342622375936/members:
write_json(controlaccts,"list_control_Q_accts_02-08-2021.json")
I think this is ready to run/ works, EXCEPT that none of the control accounts happen to be in the tweet sample Josh gave me, so I get an error that there is only 1 class of data (the Q account tweets).
Re discussion with Josh Minot, want to use a logistic regression model to identify some words and potentially ngrams which can be used to differentiate Q and non-Q content on twitter, purpose being to support/ enhance the words manually found.
My noob notes