Performance dropped critically

src-d / identity-matching

source{d} extension to match Git signatures to real people.

GNU General Public License v3.0

17 stars 13 forks source link

Performance dropped critically #91

Closed warenlg closed 4 years ago

warenlg commented 5 years ago

I've been benchmarking the performance of match-identities and figured out I was not able to process some orgs I used to process pretty quickly 2 months ago. For example, digitalocean used to take me 3sec, now it is hanging forever. The step that is hanging is reducing identities.

The drop has been introduced between v2.0.0 and v.3.0.0. After going through the commit history, I figured out the drop started at this commit when adding the primary name feature.

The list of git signatures to process includes duplicates (to get the frequence), then commit_date and then commit hash. So, we pass from:

1907 to 45287 signatures for src-d and from 1sec to 60sec (CPU time) to process
57K to 1.77M signatures for digitalocean and from 3sec to forever (CPU time) to process

If we want to keep this feature, we might want to optimize this a bit.

irinakhismatullina commented 5 years ago

I've looked through the commit, on the first sight the only affecting thing may be this.

Here instead of selecting DISTINCT commit signatures, they select all of them to count the stats. If after that no deduplication was added, and all unique commit signatures are now considered as identities, this may lead to such effect...

irinakhismatullina commented 5 years ago

It is easy to check - if after adding this single DISTINCT here performance improves, this is it and we know what to fix:)

@warenlg Is it easy for you to try it in your experiment set up?

UPD: Oh now in the current version it will be useless as we started extracting dates. Is it possible for you to check it at this first breaking revision?

warenlg commented 5 years ago

Yes this brings back the good old performance but indeed, DISTINCT would be useless today since we collect the commit_date and above all commit_hash. We have to look if commit_hash is really useful, store the commit_date more efficiently (maybe for example, all dates in the same row, instead of duplicating name, email each time) and also check if there is not a hidden quadratic complexity somewhere.

irinakhismatullina commented 5 years ago

Awesome!

Do we really collect commit_hash? Haven't seen it at all.

As to commit_date, nothing complex is required, the dates can be discarded right after stats calculation (which is always finished before reducing), so deduplication can be added easily and the performance will be shiny.

warenlg commented 5 years ago

Yes we do, but not sure why yet, https://github.com/src-d/identity-matching/commit/f1c67498fc38944ef31e2e51c1148b2917ff7d82

vmarkovtsev commented 5 years ago

Will a group by work?

vmarkovtsev commented 4 years ago

Yeah we collect hashes because we need them for GitHub API.

vmarkovtsev commented 4 years ago

I fixed this with the GROUP BY as I suggested.

warenlg commented 4 years ago

Just tested the performance again, and indeed it runs way faster now