Open dscho opened 4 years ago
I just noticed that @newren is scheduled to speak at Git Merge 2020 about "Scaling the Merge Machinery". The link https://git-merge.com/scaling-the-merge-machinery is not yet populated, apparently, but in a promotional email, I saw this description:
An overlooked dimension of scaling Git is the large number of renames. In this talk, Elijah will share how Palantir Technologies has worked to optimize rename detection and his learnings from the process.
This talk is therefore at least related to what is described in this here ticket.
Some general notes:
Some specifics about the talk I gave:
Hope that helps, Elijah
@dscho I would love to work on this or on the conversion of shell to builtin files , can you guide me through the process? :)
This might not be a good project to work on right now; it may turn out to not be useful at all after my other optimizations, plus you'd have to be working with diffcore-rename.c while I'm busy making all kinds of other modifications to it. I'm not stopping you if you really want to work on it, just giving you a heads up.
(The conversion of shell to builtin files or other things within git are fine projects to work on, of course.)
@newren thanks for replying, actually i am also planning to participate google summer of code and i wanted to contribute to git and i would really appreciate if you could guide me, i have already submitted a patch for and open issue about git bisect not working in sub directories and i want to work more on git. Do you think it would be possible?
You're going off topic. ;-) This ticket should remain about its subject, accelerating rename detection and range-diff. Briefly, though, I don't have the capacity to take on a GSoC student this year. I did a quick search for your contributions and saw that they haven't made it to the git mailing list, which is where most potential mentors will be watching, not GitHub. I did a search and found https://github.com/git/git/pull/736 and noticed that the gitgitgadget-git bot provided helpful suggestions on your patches, provided numerous links, and even pointed you at the git-mentoring@googlegroups.com mailing list (which is under-utilized), which I was going to suggest as well. Hopefully that'll help connect you where you need to go. Best of luck!
thank you so much for the advice, I have just submitted the patch to mailing list and will look into the group you mentioned. :)
When detecting renames between
m
andn
files, as well as when it detects similarities ingit range-diff
betweenm
andn
commits, Git currently performsm
timesn
comparisons, which is quite expensive for largerm
andn
.There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.
A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in
libgit.a
and use it for the rename detection and for commit matching ingit range-diff
.Ideas
Approximate Nearest Neighbor Search
There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.
Locality-sensitive hashing
Locality-sensitive hashing was made famous by its application in web search.
Other methods