redbmk / bin

Subjectively useful tools
5 stars 4 forks source link

Algorithmic error in bisection in git-diff-search (no guarantee to find a local minimum) #1

Open burnpanck opened 8 years ago

burnpanck commented 8 years ago

git-diff-search selects the midpoint commit in the range of interest, and then restricts the further search to the half on the side of the midpoint that belongs to the better endpoint. However, nothing guarantees that the minimum will be on that side of the midpoint. A known bisecting minimisation algorithm is called 'golden-section-seach', it works as follows: Given three points [A,B,C] with known optimality values [x,y,z], where y is the best of the three, pick a new point D somewhere between B and the range end with the lower value; evaluate it's optimality. If it is better than B, B will be one of the new endpoints and D the new midpoint, otherwise vice versa. This way you guarantee to converge at least to a local minimum. Optimal performance is achieved when D is selected according to the golden section, with the smaller side towards B.

redbmk commented 8 years ago

I like it! Thanks, I'll have to take a look into that :)

For a little extra context, I wrote this script because I was working on an android ROM. The source code was posted for the stock phone without any source control context and had changes specific to the phone, so I was trying to figure out where in the original source code it had most likely spawned from.

Here's the stack overflow question I asked that originally lead to writing this script: http://stackoverflow.com/q/13898659/817950

I'm still not convinced that diff -burN -x.git my_git_subtree my_src_subtree | wc -l is the best way to determine the optimality value, hence this other question: http://stackoverflow.com/q/14718696/817950

burnpanck commented 8 years ago

Indeed, I came here through the first stackoverflow question. I had a similar issue with a patched linux kernel supplied with some hardware board, patched for that board but without indicated what was actually patched. I was also questioning your diff -burN -x.git ... and played around with git diff --numstat -- gitrepo srctree, but that was way too slow for a huge project like the linux kernel. In the end I succeeded using diff -burN -x.git, but with a different search method: Since the kernel history is highly branched, git log --oneline cannot provide a useful flattening of the history (commits from different branches get interspersed, so the linearisation will provide lots of false local maxima). Instead, I retrieved the tree structure of the history using git rev-list --parent, and then just followed the three from HEAD jumping from branch/merge to branch/merge. At each point, I queued all directly reachable branch/merge commits, keeping them in a priority queue, such that I try the neighbours of the local minimum first. When a global minimum was found, the queue was cleared. From the best branch/merge commit found, I further applied the same algorithm to direct parental relationship between commits, ignoring merges/branches.

redbmk commented 8 years ago

Would you mind sharing the code you wrote for that? It could be handy to have that later. Maybe the git-diff-search tool could even have different options depending on how you want to search and maybe an option for providing an alternate method to get the optimality values.

burnpanck commented 8 years ago

Actually, I did a completely independent implementation in python. In fact, it was not even a complete script, but rather used in an interactive fashion. I did put together a script now, but it's untested. I'll file a PR against your repo.

redbmk commented 8 years ago

Awesome, thanks!