jakartaresearch / adi-buzzer

Analyzing and Detecting Indonesia Buzzer in Twitter About Politics and Social Issues
3 stars 0 forks source link

Machine learning model for bot detection #8

Open andreaschandra opened 4 years ago

andreaschandra commented 4 years ago

a given topic or hashtag, we want to see if the population of tweets more likely to flood by buzzer or user organic

or

given a buzzer account, we want to see the major topics to buzzing about

This task includes

rubentea16 commented 4 years ago

Prepare Social Politics Word Dictionary (SPWD)

Propose Feature Set :

Feature Engineering :

andreaschandra commented 3 years ago

@rubentea16 kalo beragam teknik tapi scorenya masih jelek, mungkin labelingnya kurang konsisten atau kurang banyak

andreaschandra commented 3 years ago

Baseline model result @rubentea16

BernouliNB
accuracy: 0.78 | precision: 0.60 | recall: 0.21 | f score: 0.32

Linear SVM
accuracy: 0.85 | precision: 0.74 | recall: 0.57 | f score: 0.64

Random Forest
accuracy: 0.82 | precision: 0.74 | recall: 0.43 | f score: 0.54

Gradient Boosting
accuracy: 0.84 | precision: 0.73 | recall: 0.55 | f score: 0.63

AdaBoost
accuracy: 0.81 | precision: 0.63 | recall: 0.58 | f score: 0.60
rubentea16 commented 3 years ago

Baseline model result @rubentea16

BernouliNB
accuracy: 0.78 | precision: 0.60 | recall: 0.21 | f score: 0.32

Linear SVM
accuracy: 0.85 | precision: 0.74 | recall: 0.57 | f score: 0.64

Random Forest
accuracy: 0.82 | precision: 0.74 | recall: 0.43 | f score: 0.54

Gradient Boosting
accuracy: 0.84 | precision: 0.73 | recall: 0.55 | f score: 0.63

AdaBoost
accuracy: 0.81 | precision: 0.63 | recall: 0.58 | f score: 0.60

ini pake feature apa aja?

andreaschandra commented 3 years ago

@rubentea16 tweets aja, cek ini https://github.com/jakartaresearch/adi-buzzer/blob/dev/notebook/40_buzzer_classifier.ipynb

rubentea16 commented 3 years ago

Performance Benchmark

Notes :

Model Desc Features Word Embedding Accuracy Precision Recall F1-score
RFC - multiple-feat TF-IDF 0.84 0.75 0.33 0.45
RFC - single-feat TF-IDF 0.84 0.72 0.35 0.47
SMOTE+RFC Oversampling train data (Minor class) multiple-feat TF-IDF (desc = 3K dim & tweet = 50K dim) 0.86 0.66 0.62 0.64
SMOTE+RFC Oversampling train data (Minor class) single-feat BPE (tweet = 300 dim) 0.86 0.68 0.57 0.62
SMOTE+SVC(default) Oversampling train data (Minor class) single-feat BPE (tweet = 300 dim) 0.84 0.59 0.73 0.65
SMOTE+XGBoost(default) Oversampling train data (Minor class) single-feat BPE (tweet = 300 dim) 0.86 0.66 0.62 0.64
andreaschandra commented 3 years ago

0.64

interesting

andreaschandra commented 3 years ago

Result after QA label

Algo acc pre rec fsc
Bernouli NB accuracy: 0.78 precision: 0.75 recall: 0.21 f score: 0.33
SVM accuracy: 0.85 precision: 0.75 recall: 0.60 f score: 0.67
Random Forest accuracy: 0.81 precision: 0.77 recall: 0.34 f score: 0.47
Gradient Boosting accuracy: 0.84 precision: 0.78 recall: 0.53 f score: 0.63
AdaBoost accuracy: 0.82 precision: 0.67 recall: 0.56 f score: 0.61
andreaschandra commented 3 years ago
Algo acc pre rec fsc
Bernouli NB accuracy: 0.82 precision: 0.54 recall: 0.69 f score: 0.61
SVM accuracy: 0.87 precision: 0.69 recall: 0.65 f score: 0.67
RF accuracy: 0.85 precision: 0.74 recall: 0.43 f score: 0.54
Gradient Boosting accuracy: 0.87 precision: 0.72 recall: 0.54 f score: 0.62
AdaBoost accuracy: 0.84 precision: 0.60 recall: 0.56 f score: 0.58