gitgitgadget / git

GitGitGadget's Git fork. Open Pull Requests here to submit them to the Git mailing list
https://gitgitgadget.github.io/
Other
204 stars 132 forks source link

Accelerate rename detection and range-diff #519

Open dscho opened 4 years ago

dscho commented 4 years ago

When detecting renames between m and n files, as well as when it detects similarities in git range-diff between m and n commits, Git currently performs m times n comparisons, which is quite expensive for larger m and n.

There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.

A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in libgit.a and use it for the rename detection and for commit matching in git range-diff.

Ideas

Approximate Nearest Neighbor Search

There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.

Locality-sensitive hashing

Locality-sensitive hashing was made famous by its application in web search.

Other methods

dscho commented 4 years ago

I just noticed that @newren is scheduled to speak at Git Merge 2020 about "Scaling the Merge Machinery". The link https://git-merge.com/scaling-the-merge-machinery is not yet populated, apparently, but in a promotional email, I saw this description:

An overlooked dimension of scaling Git is the large number of renames. In this talk, Elijah will share how Palantir Technologies has worked to optimize rename detection and his learnings from the process.

This talk is therefore at least related to what is described in this here ticket.

newren commented 4 years ago

Some general notes:

Some specifics about the talk I gave:

Hope that helps, Elijah

sheikhhamza012 commented 4 years ago

@dscho I would love to work on this or on the conversion of shell to builtin files , can you guide me through the process? :)

newren commented 4 years ago

This might not be a good project to work on right now; it may turn out to not be useful at all after my other optimizations, plus you'd have to be working with diffcore-rename.c while I'm busy making all kinds of other modifications to it. I'm not stopping you if you really want to work on it, just giving you a heads up.

newren commented 4 years ago

(The conversion of shell to builtin files or other things within git are fine projects to work on, of course.)

sheikhhamza012 commented 4 years ago

@newren thanks for replying, actually i am also planning to participate google summer of code and i wanted to contribute to git and i would really appreciate if you could guide me, i have already submitted a patch for and open issue about git bisect not working in sub directories and i want to work more on git. Do you think it would be possible?

newren commented 4 years ago

You're going off topic. ;-) This ticket should remain about its subject, accelerating rename detection and range-diff. Briefly, though, I don't have the capacity to take on a GSoC student this year. I did a quick search for your contributions and saw that they haven't made it to the git mailing list, which is where most potential mentors will be watching, not GitHub. I did a search and found https://github.com/git/git/pull/736 and noticed that the gitgitgadget-git bot provided helpful suggestions on your patches, provided numerous links, and even pointed you at the git-mentoring@googlegroups.com mailing list (which is under-utilized), which I was going to suggest as well. Hopefully that'll help connect you where you need to go. Best of luck!

sheikhhamza012 commented 4 years ago

thank you so much for the advice, I have just submitted the patch to mailing list and will look into the group you mentioned. :)