src-d / identity-matching

source{d} extension to match Git signatures to real people.
GNU General Public License v3.0
17 stars 13 forks source link

Debug the bot detection pipeline #65

Closed warenlg closed 5 years ago

warenlg commented 5 years ago

A full pipeline has been proposed by @EgorBu to identify bots from user identities but some bugs seem to have sneaked into the code since the precision and recall have dropped.

So first, let's recover a good accuracy, then try different sampling strategies and finally try to simplify a bit the pipeline by removing non necessary steps/features.

warenlg commented 5 years ago

Since when this issue has been raised, the bot detection pipeline has been debugged and improved with new features, a more powerful model with tuned parameters, ...etc and we are now able to reach acceptable performance even though there is still room for improvements, i.e. The latest notebook is uploaded to neptune, and lives in the cluster.

XGBoost: train classification report
              precision    recall  f1-score   support

       False       0.89      1.00      0.94    124117
        True       0.99      0.73      0.84     57258

    accuracy                           0.91    181375
   macro avg       0.94      0.86      0.89    181375
weighted avg       0.92      0.91      0.91    181375

XGBoost: validation classification report
              precision    recall  f1-score   support

       False       0.89      1.00      0.94     41373
        True       0.99      0.73      0.84     19086

    accuracy                           0.91     60459
   macro avg       0.94      0.86      0.89     60459
weighted avg       0.92      0.91      0.91     60459