Machine learning model for bot detection

jakartaresearch / adi-buzzer

Analyzing and Detecting Indonesia Buzzer in Twitter About Politics and Social Issues

3 stars 0 forks source link

Machine learning model for bot detection #8

Open andreaschandra opened 4 years ago

andreaschandra commented 4 years ago

a given topic or hashtag, we want to see if the population of tweets more likely to flood by buzzer or user organic

given a buzzer account, we want to see the major topics to buzzing about

This task includes

feature engineering (need to do text cleansing, preprocessing)
baseline model
early fine-tuning
evaluation
[x] define feature set

rubentea16 commented 4 years ago

Prepare Social Politics Word Dictionary (SPWD)

Propose Feature Set :

username
name
is_name_social_political
desc
tweets
n_tweet
quoted_tweets
hashtag
n_tweet_use_hashtag
ratio_tweets_use_hashtag
n_photo
n_video
content_url

Feature Engineering :

[x] is_name_social_political (1/0) <- create model
[x] n_tweet
[x] hashtag
[x] n_tweet_use_hashtag
[x] ratio_tweet_use_hashtag
[x] n_photo
[x] n_video
[x] content_url

andreaschandra commented 3 years ago

@rubentea16 kalo beragam teknik tapi scorenya masih jelek, mungkin labelingnya kurang konsisten atau kurang banyak

andreaschandra commented 3 years ago

Baseline model result @rubentea16

BernouliNB
accuracy: 0.78 | precision: 0.60 | recall: 0.21 | f score: 0.32

Linear SVM
accuracy: 0.85 | precision: 0.74 | recall: 0.57 | f score: 0.64

Random Forest
accuracy: 0.82 | precision: 0.74 | recall: 0.43 | f score: 0.54

Gradient Boosting
accuracy: 0.84 | precision: 0.73 | recall: 0.55 | f score: 0.63

AdaBoost
accuracy: 0.81 | precision: 0.63 | recall: 0.58 | f score: 0.60

rubentea16 commented 3 years ago

Baseline model result @rubentea16

BernouliNB
accuracy: 0.78 | precision: 0.60 | recall: 0.21 | f score: 0.32

Linear SVM
accuracy: 0.85 | precision: 0.74 | recall: 0.57 | f score: 0.64

Random Forest
accuracy: 0.82 | precision: 0.74 | recall: 0.43 | f score: 0.54

Gradient Boosting
accuracy: 0.84 | precision: 0.73 | recall: 0.55 | f score: 0.63

AdaBoost
accuracy: 0.81 | precision: 0.63 | recall: 0.58 | f score: 0.60

ini pake feature apa aja?

andreaschandra commented 3 years ago

@rubentea16 tweets aja, cek ini https://github.com/jakartaresearch/adi-buzzer/blob/dev/notebook/40_buzzer_classifier.ipynb

rubentea16 commented 3 years ago

Performance Benchmark

Notes :

multiple_feat = tweets, user_desc, is_name_social_political, ratio_tweets_use_hashtag, n_tweet, n_photo, n_video
single_feat = tweets
RFC = Random Forest Classifier(n_estimator=400)

Model	Desc	Features	Word Embedding	Accuracy	Precision	Recall	F1-score
RFC	-	multiple-feat	TF-IDF	0.84	0.75	0.33	0.45
RFC	-	single-feat	TF-IDF	0.84	0.72	0.35	0.47
SMOTE+RFC	Oversampling train data (Minor class)	multiple-feat	TF-IDF (desc = 3K dim & tweet = 50K dim)	0.86	0.66	0.62	0.64
SMOTE+RFC	Oversampling train data (Minor class)	single-feat	BPE (tweet = 300 dim)	0.86	0.68	0.57	0.62
SMOTE+SVC(default)	Oversampling train data (Minor class)	single-feat	BPE (tweet = 300 dim)	0.84	0.59	0.73	0.65
SMOTE+XGBoost(default)	Oversampling train data (Minor class)	single-feat	BPE (tweet = 300 dim)	0.86	0.66	0.62	0.64

andreaschandra commented 3 years ago

0.64

interesting

andreaschandra commented 3 years ago

Result after QA label

Algo	acc	pre	rec	fsc
Bernouli NB	accuracy: 0.78	precision: 0.75	recall: 0.21	f score: 0.33
SVM	accuracy: 0.85	precision: 0.75	recall: 0.60	f score: 0.67
Random Forest	accuracy: 0.81	precision: 0.77	recall: 0.34	f score: 0.47
Gradient Boosting	accuracy: 0.84	precision: 0.78	recall: 0.53	f score: 0.63
AdaBoost	accuracy: 0.82	precision: 0.67	recall: 0.56	f score: 0.61

andreaschandra commented 3 years ago

Algo	acc	pre	rec	fsc
Bernouli NB	accuracy: 0.82	precision: 0.54	recall: 0.69	f score: 0.61
SVM	accuracy: 0.87	precision: 0.69	recall: 0.65	f score: 0.67
RF	accuracy: 0.85	precision: 0.74	recall: 0.43	f score: 0.54
Gradient Boosting	accuracy: 0.87	precision: 0.72	recall: 0.54	f score: 0.62
AdaBoost	accuracy: 0.84	precision: 0.60	recall: 0.56	f score: 0.58