Open zurk opened 5 years ago
@warenlg did MVP of this feature (https://src-d.slack.com/archives/C7USX021L/p1563778058004300):
Why don't we remove bot, ci automated stuff etc from the identity matching table with simple regexp ? Right now, I might have 10% of bots in the cloudfroundry identities I'm working with for the demo, e.g.,
["log cache ci", "metric store ci", "loggregator ci",
"pivotal publication toolsmiths", "cf-infra-bot",
"cloud foundry buildpacks team robot",
"garden windows", "final release builder",
"pipeline", "flintstone ci", "capi ci",
"container networking bot", "cf mega bot",
"routing-ci", "cf bpm", "uaa identity bot",
"pcf backup & restore ci",
"ci bot", "cfcr ci bot", "cfcr"]
I removed 2.5k rows over 15k in total excluding name identities matching [^a-zA-Z]ci[^a-zA-Z]|bot$|pipeline|release|routing
@EgorBu as was discussed I assign this issue to you.
Regardless of an approach you choose, please create a list of filtered bots so we can also review them with eyes and see that we do not filter anything unrelated.
Thanks K for filling the issue
Current pattern: r"[^a-zA-Z|]ci\W|[\s-]ci\W|ci[\s-]|[\s-]ci[\s-]|bot$|pipeline|release|routing"
Problems with regexp:
('cici jiayi shen', 'jiayis.18@intl.zju.edu.cn'),
('daniel adrian bohbot', 'daniel.bohbot@gmail.com'),
('horaci macias', 'hmacias@avaya.com'),
('melvindebot', '44030121+melvindebot@users.noreply.github.com'),
("daniel obot", "danobot@hotmail.com")
Some French and Chinese names/surnames may look like bots for regexp
@EgorBu Regarding French and Chinese, GitHub profiles often contain the country code. You can take the "users" table from GHTorrent and remove "bots" which have any country assigned.
Ideas:
author.date, author.email, author.name, committer.date, committer.email, committer.name
)Updates:
Next steps:
parquet
/csv
abbot
, julia jenkins
and so ongardener@tensorflow
for example)egor@bla.ru / Egor's bot for deployment
egors-bot-deploy@bla.ru / Egorka
-> so it will be labeled as not bot and email tells that it's a botrepository
name is included - the quality could be found here - https://gist.github.com/EgorBu/a333409dfc12f89ac5fa1dc71461a3c0 false positives - email contains bot related info, and the name doesn't contain
and false negatives - email doesn't contain bot related info, and the name contains
abot
as [a, bot]
- and it will make almost impossible for model to differentiate one class from another
token splitter
to split victor@abot.fr
into [victor, abot, fr]
We can take some rule-based approach as a benchmark: email contains
bot
word orno-reply
. However, there are emails liketensorflow-gardener@tensorflow.org
that is hard to find. So some ML should be applied to find them. Commit-time-series features can be used.