src-d / identity-matching

source{d} extension to match Git signatures to real people.
GNU General Public License v3.0
17 stars 13 forks source link

[WIP] fast GitHub match with GetCommit API #63

Closed smola closed 5 years ago

smola commented 5 years ago

What?

Why?

This removes the need for the search API, which has much lower API quota limits. For an organization like pytorch, it can fit the process within 1 hour limit, as opposed to a few hours.

How to improve?

This pull request is a completely dirty approach that does not work properly with the in-disk caches (they should be removed before running) and doesn't respect the separation of responsibilities of each step. However, it can be improve to better fit the process as follows:

Note that this requires that repository_id from gitbase is a proper GitHub URL and not an arbitrary path. This is the case for source{d} CE imported repo, for example, but may not be the case in other scenarios. So probably MatchByEmail should fallback to the Search API when the given repo does not match the GitHub pattern.

vmarkovtsev commented 5 years ago

So the plan is to scan the distinct commit signatures to pre-populate the cache, thus the search API will not have to be executed at all + remember about https://github.com/src-d/identity-matching/pull/63/files#r322251594 I will start from scratch this week.

Also spotted: