Find a way to distinguish regular users from bots

zurk commented 5 years ago

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like tensorflow-gardener@tensorflow.org that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

zurk commented 5 years ago

@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):

Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,

["log cache ci", "metric store ci", "loggregator ci",
                  "pivotal publication toolsmiths", "cf-infra-bot",
                  "cloud foundry buildpacks team robot",
                 "garden windows", "final release builder",
                 "pipeline", "flintstone ci", "capi ci",
                  "container networking bot", "cf mega bot",
                 "routing-ci", "cf bpm", "uaa identity bot",
                 "pcf backup & restore ci",
                 "ci bot", "cfcr ci bot", "cfcr"]

I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing

@EgorBu as was discussed I assign this issue to you.

Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.

warenlg commented 5 years ago

Thanks K for filling the issue

vmarkovtsev commented 5 years ago

EgorBu commented 5 years ago

Problems with regexp:

('cici jiayi shen', 'jiayis.18@intl.zju.edu.cn'), 
('daniel adrian bohbot', 'daniel.bohbot@gmail.com'),
('horaci macias', 'hmacias@avaya.com'),
('melvindebot', '44030121+melvindebot@users.noreply.github.com'),
("daniel obot", "danobot@hotmail.com")

Some French and Chinese names/surnames may look like bots for regexp

vmarkovtsev commented 5 years ago

@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.

EgorBu commented 5 years ago

Ideas:

use regexp to find highly probable bots (19k found from 1300M rows author.date, author.email, author.name, committer.date, committer.email, committer.name)
calculate authors/committer fraction - it may show that distributions for normal users and bots are different
contribution activity - time & counts & repositories - it may show that distributions for normal users and bots are different
entropy of commit messages - idea that bots use heavily some patterns
intersection of name & repository contributed most
pretrained (or train on dataset) NN model to extract message embeddings + clustering for messages - if user messages are always from 1-2-3 clusters it could be a signal of bot
pretrained (or train on dataset) NN model to extract email/name embeddings + classification/clustering - it could be a good approach because we have quite a lot of bot names
use statistical features, messages, email/names as input for NN to make embeddings (triplet loss to make embeddings of bots closer to each other) + K nearest-neighbors search / classification

Updates:

launched pipeline for extraction statistics for bots - and it's slow (should be ~50 hours).
downloaded message dataset, reading about entropy measurements and other possible approaches
reading and thinking about ideas, coding

Next steps:

I will rewrite pipeline to use Spark - the task matches the map-reduce paradigm
Resave datasets as parquet/csv
launch pipeline for statistics
launch pipeline for entropy
intersection of name & repository contributed most

EgorBu commented 4 years ago

There are at least several problems that may affect the quality:

Noisy labels -
- false positives from regexp - like: abbot, julia jenkins and so on
- false negatives - not detected bots (gardener@tensorflow for example)
Model input doesn't contain required info to make a correct prediction
- false negatives - email doesn't contain bot related info, and the name contains. Ex: egor@bla.ru / Egor's bot for deployment
The name doesn't contain the required information to label it as bot
- false positives - email contains bot related info, and the name doesn't contain. Ex: egors-bot-deploy@bla.ru / Egorka -> so it will be labeled as not bot and email tells that it's a bot
Metrics. Deduplication:
- deduplication is done by several fields - and if repository name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0
- it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
Metrics. Usage
- we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
Dataset
- Another possible reason that quality was higher here is some dataset issues

Hypothesis to check

metrics - clarify how to measure quality
Dataset
- select a row in dataset
- split dataset into 2 parts before some row and after
- assign labels (0 - before, 1 - after)
- train classifier - if the quality is better than random - something is fishy with dataset
false positives - email contains bot related info, and the name doesn't contain and false negatives - email doesn't contain bot related info, and the name contains
- labels & predictions should be computed
- extract features separately from names & emails
- find nearest neighbors by name
- find nearest neighbors by email
- several situations are possible:
  - labels & predictions are the same among nearest neighbors for name & emails - perfect
  - labels among nearest neighbors for name are not the same - possible regexp mistakes?
  - predictions are not the same among nearest neighbors for emails - check it
  - labels & predictions are not the same among nearest neighbors for name & emails - possible regexp mistake?
model overfits to mistakes on regexp
- hypothesis - number of mistakes is not so big
- train several models on different chunks of data - it will reduce number of mistakes in each chunk
- make voting among models when making prediction
- focus on samples with different predictions and labels
- focus on samples with different predictions
features are not good enough
- BPE could extract features from abot as [a, bot] - and it will make almost impossible for model to differentiate one class from another
  - use token splitter to split victor@abot.fr into [victor, abot, fr]
  - add a feature that will highlight if something is in the exception list
  - don't extract BPE features from exceptions

src-d / identity-matching

Find a way to distinguish regular users from bots #9

There are at least several problems that may affect the quality:

Hypothesis to check

Papers: