Closed vmarkovtsev closed 5 years ago
I achieved this task following these steps:
repository_id
, commit_hash
, commit_author_name
, commit_author_email
using the gitbase and a python client on 22 open source stacks.repository_id
, author_email
, author_name
, author_id
.match-identities
with the --cache
option pointing to the previous CSV file on which we dropped the author_id
column, with 10 different values for the MaxIdentities
parameter: [1, 5, 10, 20, 30, 40, 50, 100, 200, 500]
.FP = set(pred_graph.edges) - set.intersection(set(pred_graph.edges), set(true_graph.edges))
FN = set(ght_graph.edges) - set.intersection(set(pred_graph.edges), set(ght_graph.edges))
MaxIdentities
for each org.MaxIdentities=20
stands for a good trade-off.
One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.
This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.