src-d / identity-matching

source{d} extension to match Git signatures to real people.
GNU General Public License v3.0
17 stars 13 forks source link

Study how the quality depends on the hard identity size limit #30

Closed vmarkovtsev closed 5 years ago

vmarkovtsev commented 5 years ago

One of the ways how to erase bots which were not excluded by the blacklist is to set a hard size threshold. That is if the number of unique names is bigger than, say, 100, we do something, e.g. drop completely or split.

This issue is about plotting the dependency of our quality metrics from the size threshold when we drop.

warenlg commented 5 years ago

I achieved this task following these steps:

  1. collect repository_id, commit_hash, commit_author_name, commit_author_email using the gitbase and a python client on 22 open source stacks.
  2. Iterate through 2019-05-01 GHTorrent dump, and map every commit hash with the GHTorrent id of its author.
  3. For each org, create a CSV file with repository_id, author_email, author_name, author_id.
  4. For each org, create 10 different identity matching table in Parquet format running match-identities with the --cache option pointing to the previous CSV file on which we dropped the author_id column, with 10 different values for the MaxIdentities parameter: [1, 5, 10, 20, 30, 40, 50, 100, 200, 500].
  5. For each org, and each identity table generated (so 22x10), build 2 identity graph (1) from GHTorrent identity mapping (2) from our own identity matching.
  6. Compute precision and recall using the following definitions for false positive and false negative.
    • FP = set(pred_graph.edges) - set.intersection(set(pred_graph.edges), set(true_graph.edges))
    • FN = set(ght_graph.edges) - set.intersection(set(pred_graph.edges), set(ght_graph.edges))
  7. Plot the precision and recall curves depending on MaxIdentities for each org.
  8. Conclude that MaxIdentities=20 stands for a good trade-off.

idmatching_pr_curves