Closed warenlg closed 5 years ago
Since when this issue has been raised, the bot detection pipeline has been debugged and improved with new features, a more powerful model with tuned parameters, ...etc and we are now able to reach acceptable performance even though there is still room for improvements, i.e. The latest notebook is uploaded to neptune, and lives in the cluster.
XGBoost: train classification report
precision recall f1-score support
False 0.89 1.00 0.94 124117
True 0.99 0.73 0.84 57258
accuracy 0.91 181375
macro avg 0.94 0.86 0.89 181375
weighted avg 0.92 0.91 0.91 181375
XGBoost: validation classification report
precision recall f1-score support
False 0.89 1.00 0.94 41373
True 0.99 0.73 0.84 19086
accuracy 0.91 60459
macro avg 0.94 0.86 0.89 60459
weighted avg 0.92 0.91 0.91 60459
A full pipeline has been proposed by @EgorBu to identify bots from user identities but some bugs seem to have sneaked into the code since the precision and recall have dropped.
So first, let's recover a good accuracy, then try different sampling strategies and finally try to simplify a bit the pipeline by removing non necessary steps/features.