src-d / identity-matching

source{d} extension to match Git signatures to real people.
GNU General Public License v3.0
17 stars 13 forks source link

Find a way to distinguish regular users from bots #9

Open zurk opened 5 years ago

zurk commented 5 years ago

We can take some rule-based approach as a benchmark: email contains bot word or no-reply. However, there are emails like tensorflow-gardener@tensorflow.org that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.

zurk commented 5 years ago

@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):

Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,

["log cache ci", "metric store ci", "loggregator ci",
                  "pivotal publication toolsmiths", "cf-infra-bot",
                  "cloud foundry buildpacks team robot",
                 "garden windows", "final release builder",
                 "pipeline", "flintstone ci", "capi ci",
                  "container networking bot", "cf mega bot",
                 "routing-ci", "cf bpm", "uaa identity bot",
                 "pcf backup & restore ci",
                 "ci bot", "cfcr ci bot", "cfcr"]

I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing

@EgorBu as was discussed I assign this issue to you.

Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.

warenlg commented 5 years ago

Thanks K for filling the issue

vmarkovtsev commented 5 years ago

Related to https://github.com/src-d/eee-identity-matching/issues/30

EgorBu commented 5 years ago

Current pattern: r"[^a-zA-Z|]ci\W|[\s-]ci\W|ci[\s-]|[\s-]ci[\s-]|bot$|pipeline|release|routing"

Problems with regexp:

('cici jiayi shen', 'jiayis.18@intl.zju.edu.cn'), 
('daniel adrian bohbot', 'daniel.bohbot@gmail.com'),
('horaci macias', 'hmacias@avaya.com'),
('melvindebot', '44030121+melvindebot@users.noreply.github.com'),
("daniel obot", "danobot@hotmail.com")

Some French and Chinese names/surnames may look like bots for regexp

vmarkovtsev commented 5 years ago

@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.

EgorBu commented 5 years ago

Ideas:

Updates:

Next steps:

EgorBu commented 4 years ago

There are at least several problems that may affect the quality:

  1. Noisy labels -
    • false positives from regexp - like: abbot, julia jenkins and so on
    • false negatives - not detected bots (gardener@tensorflow for example)
  2. Model input doesn't contain required info to make a correct prediction
    • false negatives - email doesn't contain bot related info, and the name contains. Ex: egor@bla.ru / Egor's bot for deployment
  3. The name doesn't contain the required information to label it as bot
    • false positives - email contains bot related info, and the name doesn't contain. Ex: egors-bot-deploy@bla.ru / Egorka -> so it will be labeled as not bot and email tells that it's a bot
  4. Metrics. Deduplication:
    • deduplication is done by several fields - and if repository name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0
    • it's higher than current - probably it could mean that standard names for bots are much more frequent - and in most of the cases standard names will be detected with high quality
  5. Metrics. Usage
    • we still don't have a clear understanding of how it should be applied (for each commit, for each identity, etc) - metrics should be selected on usage
  6. Dataset
    • Another possible reason that quality was higher here is some dataset issues

Hypothesis to check

Papers: