Improve performance for inter-file code motion analysis.

Following on from the comment at the end of ticket #26 ....

Performance is disappointing when inter-file code motion analysis involves large files.

Do something about it.

The question of performance has arisen while working on ticket #26.

The example shown on the README videos is used as a benchmark.

Git commit SHA: c2d3e129941763e32f553c7e95ba33ca2cf9a8c4 - this was prior to commencing work on that ticket.

Three runs each take ~2 seconds, nice and snappy.

The matching process yields:

09:31:46.464 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Minimum match window size across all files over all sides: 4
09:31:46.466 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Minimum sure-fire match window size across all files over all sides: 4
09:31:46.466 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Maximum match window size across all files over all sides: 1124
09:31:46.763 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 283, number of matches is: 172, looking for a more optimal match with estimated window size of: 445.
09:31:46.819 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 364, number of matches is: 81, looking for a more optimal match with estimated window size of: 444.
09:31:46.831 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 444, number of matches is: 1, restarting search to look for smaller matches.
09:31:46.925 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 223, number of matches is: 226, looking for a more optimal match with estimated window size of: 892.
09:31:46.956 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 333, number of matches is: 27, looking for a more optimal match with estimated window size of: 359.
09:31:46.982 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 359, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.045 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 181, number of matches is: 198, looking for a more optimal match with estimated window size of: 533.
09:31:47.072 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 270, number of matches is: 22, looking for a more optimal match with estimated window size of: 291.
09:31:47.087 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 291, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.131 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 147, number of matches is: 96, looking for a more optimal match with estimated window size of: 242.
09:31:47.154 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 242, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.202 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 122, number of matches is: 78, looking for a more optimal match with estimated window size of: 174.
09:31:47.228 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 174, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.274 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 88, number of matches is: 59, looking for a more optimal match with estimated window size of: 146.
09:31:47.313 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 146, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.358 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 74, number of matches is: 20, looking for a more optimal match with estimated window size of: 84.
09:31:47.400 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 84, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.462 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 43, number of matches is: 113, looking for a more optimal match with estimated window size of: 82.
09:31:47.501 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 82, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.548 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 42, number of matches is: 82, looking for a more optimal match with estimated window size of: 67.
09:31:47.583 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 67, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.624 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 35, number of matches is: 112, looking for a more optimal match with estimated window size of: 48.
09:31:47.655 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 48, number of matches is: 8, restarting search to look for smaller matches.
09:31:47.711 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 25, number of matches is: 7, looking for a more optimal match with estimated window size of: 31.
09:31:47.746 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 31, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.793 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 17, number of matches is: 14, looking for a more optimal match with estimated window size of: 24.
09:31:47.851 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 24, number of matches is: 1, restarting search to look for smaller matches.
09:31:47.904 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 13, number of matches is: 10, looking for a more optimal match with estimated window size of: 22.
09:31:47.944 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 22, number of matches is: 1, restarting search to look for smaller matches.
09:31:48.074 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 5, number of matches is: 10, restarting search to look for smaller matches.
09:31:48.215 [io-compute-4] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search for matches whose size is no less than the sure-fire match window size of: 4 has terminated; results are:

So that was a great result.

Git commit SHA: 36f9f82b8bc67599462bf10481fb588e4c23a218 is the latest and greatest for ticket #26.

Three runs take 2 minutes 24 seconds, 2 minutes 30 seconds and 2 minutes 23 seconds.

Oh, drat.

The matching process yields:

10:00:45.803 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Minimum match window size across all files over all sides: 4
10:00:45.806 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Minimum sure-fire match window size across all files over all sides: 4
10:00:45.806 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Maximum match window size across all files over all sides: 15225
10:00:54.641 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 7614, number of matches is: 7612, looking for a more optimal match with estimated window size of: 15225.
10:00:54.695 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 15225, number of matches is: 1, restarting search to look for smaller matches.
10:01:01.336 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 7614, number of matches is: 8, looking for a more optimal match with estimated window size of: 7621.
10:01:08.271 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 7621, number of matches is: 1, restarting search to look for smaller matches.
10:01:13.247 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 3812, number of matches is: 4269, looking for a more optimal match with estimated window size of: 7603.
10:01:20.065 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 7603, number of matches is: 1, restarting search to look for smaller matches.
10:01:24.685 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 3803, number of matches is: 486, looking for a more optimal match with estimated window size of: 4288.
10:01:29.495 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 4288, number of matches is: 1, restarting search to look for smaller matches.
10:01:32.977 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 2145, number of matches is: 819, looking for a more optimal match with estimated window size of: 2612.
10:01:36.824 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 2612, number of matches is: 1, restarting search to look for smaller matches.
10:01:39.347 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 1307, number of matches is: 1796, looking for a more optimal match with estimated window size of: 2495.
10:01:43.042 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 2495, number of matches is: 1, restarting search to look for smaller matches.
10:01:45.424 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 1249, number of matches is: 723, looking for a more optimal match with estimated window size of: 1644.
10:01:48.293 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 1644, number of matches is: 1, restarting search to look for smaller matches.
10:01:50.100 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 823, number of matches is: 898, looking for a more optimal match with estimated window size of: 1575.
10:01:52.913 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 1575, number of matches is: 1, restarting search to look for smaller matches.
10:01:54.652 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 789, number of matches is: 179, looking for a more optimal match with estimated window size of: 967.
10:01:56.605 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 967, number of matches is: 1, restarting search to look for smaller matches.
10:01:58.684 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 244, number of matches is: 253, looking for a more optimal match with estimated window size of: 978.
10:01:59.709 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 364, number of matches is: 81, looking for a more optimal match with estimated window size of: 444.
10:02:00.884 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 444, number of matches is: 1, restarting search to look for smaller matches.
10:02:01.678 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 223, number of matches is: 226, looking for a more optimal match with estimated window size of: 892.
10:02:02.676 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 333, number of matches is: 27, looking for a more optimal match with estimated window size of: 359.
10:02:03.712 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 359, number of matches is: 1, restarting search to look for smaller matches.
10:02:04.480 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 181, number of matches is: 222, looking for a more optimal match with estimated window size of: 533.
10:02:05.417 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 270, number of matches is: 22, looking for a more optimal match with estimated window size of: 291.
10:02:06.326 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 291, number of matches is: 1, restarting search to look for smaller matches.
10:02:07.030 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 147, number of matches is: 154, looking for a more optimal match with estimated window size of: 242.
10:02:07.855 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 242, number of matches is: 1, restarting search to look for smaller matches.
10:02:08.491 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 122, number of matches is: 161, looking for a more optimal match with estimated window size of: 204.
10:02:09.257 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 204, number of matches is: 1, restarting search to look for smaller matches.
10:02:09.853 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 103, number of matches is: 116, looking for a more optimal match with estimated window size of: 174.
10:02:10.550 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 174, number of matches is: 1, restarting search to look for smaller matches.
10:02:11.130 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 88, number of matches is: 59, looking for a more optimal match with estimated window size of: 146.
10:02:11.800 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 146, number of matches is: 1, restarting search to look for smaller matches.
10:02:12.353 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 74, number of matches is: 20, looking for a more optimal match with estimated window size of: 84.
10:02:12.907 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 84, number of matches is: 1, restarting search to look for smaller matches.
10:02:13.479 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 43, number of matches is: 113, looking for a more optimal match with estimated window size of: 82.
10:02:14.041 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 82, number of matches is: 1, restarting search to look for smaller matches.
10:02:14.577 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 42, number of matches is: 82, looking for a more optimal match with estimated window size of: 67.
10:02:15.132 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 67, number of matches is: 1, restarting search to look for smaller matches.
10:02:15.632 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 35, number of matches is: 112, looking for a more optimal match with estimated window size of: 48.
10:02:16.136 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 48, number of matches is: 8, restarting search to look for smaller matches.
10:02:16.693 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 25, number of matches is: 7, looking for a more optimal match with estimated window size of: 31.
10:02:17.192 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 31, number of matches is: 1, restarting search to look for smaller matches.
10:02:17.791 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 17, number of matches is: 14, looking for a more optimal match with estimated window size of: 24.
10:02:18.308 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 24, number of matches is: 1, restarting search to look for smaller matches.
10:02:19.105 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 13, number of matches is: 10, looking for a more optimal match with estimated window size of: 22.
10:02:19.628 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 22, number of matches is: 1, restarting search to look for smaller matches.
10:02:25.974 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 7, number of matches is: 5, looking for a more optimal match with estimated window size of: 11.
10:02:27.050 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 11, number of matches is: 1, restarting search to look for smaller matches.
10:02:45.966 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 5, number of matches is: 10, restarting search to look for smaller matches.
10:03:07.945 [io-compute-15] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search for matches whose size is no less than the sure-fire match window size of: 4 has terminated; results are:

Hmm...

The problem is that the older commit had code that applied code motion analysis to individual files, and only those files that required a three-way merge from Git's point of view.

The code from the newer commit applies code motion analysis right across all files that are changed on at least one branch wrt the base.

Now in the benchmark example, the file that is merged by itself using the first commit's code is fairly small; other files brought in via the merge include one heavyweight with 2721 lines of code and another two middleweights with 970 and 514 lines of code. The lightweight that was actually analysed for matches was a mere 178 lines long.

This is apparent in the match sizes:

c2d3e129941763e32f553c7e95ba33ca2cf9a8c4

Minimum match window size across all files over all sides: 4
Minimum sure-fire match window size across all files over all sides: 4
Maximum match window size across all files over all sides: 1124
Search has found an improved match at window size: 283, number of matches is: 172, looking for a more optimal match with estimated window size of: 445.
Search has found an improved match at window size: 364, number of matches is: 81, looking for a more optimal match with estimated window size of: 444.
Search has found an optimal match at window size: 444, number of matches is: 1, restarting search to look for smaller matches.
Search has found an improved match at window size: 223, number of matches is: 226, looking for a more optimal match with estimated window size of: 892.

https://github.com/sageserpent-open/kineticMerge/commit/36f9f82b8bc67599462bf10481fb588e4c23a218

Minimum match window size across all files over all sides: 4
Minimum sure-fire match window size across all files over all sides: 4
Maximum match window size across all files over all sides: 15225
Search has found an improved match at window size: 7614, number of matches is: 7612, looking for a more optimal match with estimated window size of: 15225.
Search has found an optimal match at window size: 15225, number of matches is: 1, restarting search to look for smaller matches.
Search has found an improved match at window size: 7614, number of matches is: 8, looking for a more optimal match with estimated window size of: 7621.
Search has found an optimal match at window size: 7621, number of matches is: 1, restarting search to look for smaller matches.
Search has found an improved match at window size: 3812, number of matches is: 4269, looking for a more optimal match with estimated window size of: 7603.
Search has found an optimal match at window size: 7603, number of matches is: 1, restarting search to look for smaller matches.

There is a lot more content to analyse when doing inter-file code motion analysis.

Something to ponder - to we need to look for matches right up to the size of the largest file? Where would such a large match go, if not in the original file (assuming there is just one outlier in file sizes)? If a file is modified on only one side, can we make some shortcuts?

As of Git commit SHA: 97a34be83d1a20c4d09af3cfc432b5fe45d155ec, tried a simple speculative optimisation in the iterated binary search for the optimal matches - if it looks likely that a file has a full match on two sides based on its size in tokens across the sides, then this can be used as a hint.

This gets a slight improvement down to around 2 minutes and 5 seconds or so.

An interesting development: profiling the application while it merged the benchmark example revealed a large amount of time spent doing sequence comparisons of tokens. This was put in a while back when the Rabin Fingerprint code was swapped out in favour of cheaper rolling polynomial hashing - the latter was somewhat prone to accidental collisions, so PotentialMatchKey was introduced to generalise the notion of a fingerprint, adding a sequential comparison of elements as a tiebreaker to the first step of comparing fingerprints.

That's all well and good, but the iterated binary search for the optimal matches discovers a lot of suboptimal matches on the way - and each time a match is found, that tiebreaker comparison will be executed. That's what showed up under profiling.

So, hacking CodeMotionAnalysis temporarily so that the tiebreaker code is not executed yields run times of 1 minutes 4 seconds, 1 minute 4 seconds and 1 minute 4 seconds. Much better.

Furthermore, CodeMotionAnalysisTest is still passing (and executing more test cases, good). Its failure due to hash collisions was what prompted the cutover, so maybe it is worth ditching the tiebreaker now, or at least curtailing the amount of tiebreaking done...

PotentialMatchKey was introduced in Git commit SHA: 802b783cd6985cfb98ca734357899039c89ed003.

This was done after the issue of hash collisions in RollingHash was identified in f1ca7b941a048de0c6211976ea9573db6615335b and progessively fixed through c084fddf5442e69a66e0f41711c5e0169d66cad5 and b8006c86a990a1feb1575b455dbb85ec3ac419c6.

The fix was deliberately broken as a temporary measure in 3b26a09f944a2d87ef05daa5eba634f5de8133b2 prior to working on 802b783cd6985cfb98ca734357899039c89ed003 and the breakage was subsequently reverted in be1d971366897106f2eba5e6f9765e2dd8bb5e01.

So the use of PotentialMatchKey was intended as belt-and-braces after the fix was done in RollingHash, just in case further hash collisions might be lurking around the corner. The feeling was that the performance overhead would be small enough to justify the cautious approach.

The ordering for PotentialMatchKey has been modified so that the tiebeaker content comparison is still performed, but only up to a limit (currenty 10 elements). This yields comparable performance with outright removal of the tiebreak, coming in at 1 minute 6 seconds, 1 minute 4 seconds and 1 minute 6 seconds.

The breakage in RollingHash applied in 3b26a09f944a2d87ef05daa5eba634f5de8133b2 was temporarily reapplied - CodeMotionAnalysisTest did not break.

As a further check, the tiebreak limit was temporarily hacked to zero in addition to the breakage in RollingHash - this did break CodeMotionAnalysisTest.

So the tiebreak is actually required for belt-and-braces, and it is effective with the limit of 10 elements.

Both prior to the tiebreak change and subsequently, there is a noticeable show down when matching gets to small window sizes:

13:50:30.911 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 146, number of matches is: 1, restarting search to look for smaller matches.
13:50:31.363 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 74, number of matches is: 20, looking for a more optimal match with estimated window size of: 84.
13:50:31.779 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 84, number of matches is: 1, restarting search to look for smaller matches.
13:50:32.300 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 43, number of matches is: 113, looking for a more optimal match with estimated window size of: 82.
13:50:32.733 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 82, number of matches is: 1, restarting search to look for smaller matches.
13:50:33.206 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 42, number of matches is: 82, looking for a more optimal match with estimated window size of: 67.
13:50:33.644 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 67, number of matches is: 1, restarting search to look for smaller matches.
13:50:34.125 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 35, number of matches is: 112, looking for a more optimal match with estimated window size of: 48.
13:50:34.567 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 48, number of matches is: 8, restarting search to look for smaller matches.
13:50:35.098 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 25, number of matches is: 7, looking for a more optimal match with estimated window size of: 31.
13:50:35.558 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 31, number of matches is: 1, restarting search to look for smaller matches.
13:50:36.143 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 17, number of matches is: 14, looking for a more optimal match with estimated window size of: 24.
13:50:36.625 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 24, number of matches is: 1, restarting search to look for smaller matches.
13:50:37.400 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 13, number of matches is: 10, looking for a more optimal match with estimated window size of: 22.
13:50:37.895 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 22, number of matches is: 1, restarting search to look for smaller matches.
13:50:44.056 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 7, number of matches is: 5, looking for a more optimal match with estimated window size of: 11.
13:50:45.136 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 11, number of matches is: 1, restarting search to look for smaller matches.
13:51:02.570 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 5, number of matches is: 10, restarting search to look for smaller matches.
13:51:23.168 [io-compute-2] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search for matches whose size is no less than the sure-fire match window size of: 4 has terminated; results are:

Above a match size of 22, each stage takes place fairly fast, but below that size the matching stutters with intervals of several seconds.

What's going on here?

Playing around with --minimum-ambiguous-match-size doesn't help with the end-game slow down.

Switching to a more complex benchmarking example that exhibits a more severe end-game slow down...

Profiling via VisualVM shows that the code that tests for subsumption or overlap of sections that are being considered as part of a match is really hammered.

To see why, consider the number of potential matches generated by the fingerprinting at each window size (added some temporary debug output):

Potential match count at window size: 35 is 50127.
12:31:08.852 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 35, number of matches is: 1, restarting search to look for smaller matches.
Potential match count at window size: 19 is 98020.
12:31:09.708 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 19, number of matches is: 18, looking for a more optimal match with estimated window size of: 25.
Potential match count at window size: 25 is 68605.
12:31:10.533 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 25, number of matches is: 1, restarting search to look for smaller matches.
Potential match count at window size: 14 is 189772.
12:31:11.555 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 14, number of matches is: 21, looking for a more optimal match with estimated window size of: 24.
Potential match count at window size: 24 is 72368.
12:31:12.394 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 24, number of matches is: 1, restarting search to look for smaller matches.
Potential match count at window size: 13 is 224810.
12:31:13.484 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 13, number of matches is: 11, looking for a more optimal match with estimated window size of: 23.
Potential match count at window size: 23 is 76724.
12:31:14.323 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 23, number of matches is: 1, restarting search to look for smaller matches.
Potential match count at window size: 13 is 224810.
Potential match count at window size: 8 is 1695541.
12:31:19.157 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 8, number of matches is: 7, looking for a more optimal match with estimated window size of: 11.
Potential match count at window size: 11 is 347052.
12:31:20.469 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 11, number of matches is: 1, restarting search to look for smaller matches.
Potential match count at window size: 7 is 2817115.
12:31:26.285 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 7, number of matches is: 4, looking for a more optimal match with estimated window size of: 10.
Potential match count at window size: 10 is 516341.
12:31:27.874 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 10, number of matches is: 1, restarting search to look for smaller matches.
Potential match count at window size: 6 is 4770549.
Potential match count at window size: 4 is 13200907.
12:32:02.904 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an improved match at window size: 4, number of matches is: 20, looking for a more optimal match with estimated window size of: 5.
Potential match count at window size: 5 is 7059180.
12:32:16.589 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search has found an optimal match at window size: 5, number of matches is: 10, restarting search to look for smaller matches.
Potential match count at window size: 4 is 13200907.
12:32:42.180 [io-compute-13] DEBUG c.s.k.core.CodeMotionAnalysis$ -- Search for matches whose size is no less than the sure-fire match window size of: 4 has terminated; results are:

The problem here is that we have an efficient fingerprinting algorithm that is optimised to blast through entire files - so it always produces every possible match that could involve a section in a file, even though as the window sizes get smaller, many such matches will naturally be subsumed by larger matches.

As can be seen, the number of those potential matches that turn out not to be part of a larger match becomes a sliver as the size goes down. This is exacerbated when matching has to consider files that are very large, which is what is going on here.

We have an efficient fingerprinting algorithm that is acting as a firehose for the subsumption and overlap tests to drink from!

What could be done is to break up the fingerprinting in each file so it works in-between the existing larger matches, but that might be very tricky to implement, especially when the subsumption rules depend on the kind of potential match being generated - we wouldn't know that when deciding how to break up the fingerprinting.

It would also require re-priming the fingerprinting mechanism for each stretch in a file, which would defeat the efficiency afforded by a rolling hash - although it could be argued that we still have to scan all the elements anyway when hashing, and we only want the matches that are not subsumed or overlapping with larger matches. It still feels awkward doing this, though.

Another tack is to make the testing much more efficient. Last time I looked, the fingertree implementation was the best available, but perhaps another look around is needed, maybe even rolling a custom implementaton. This doesn't appeal, but if needs must...

However, there is a quick-and-dirty solution - in the example above, the file sizes of 2581, 2612, 4288, 5076 and 15225 all turned out to be matches in their own right. Given that these larger files cause the problem, why not recognise that if an entire file has been matched, then there is no point in looking for subsumption or overlap of a section in that file at a smaller window size - there is no room for that section to escape in that file.

So once a section covering an entire file has been matched, we may as well drop the file entirely from consideration for lower match sizes.

Hmm, not so fast - it can be valid to look for smaller matches that have a section subsumed within a larger match, see here: https://github.com/sageserpent-open/kineticMerge/blob/066b868ef0dad8b184651766791407001e25f4ab/src/main/scala/com/sageserpent/kineticmerge/core/CodeMotionAnalysis.scala#L1020.

Naively blocking all matches breaks CodeMotionAnalysis and the manual benchmarking test.

On reflection, we have to allow some subsumption, otherwise we don't cover the situation where a file is edited on, say, our side but is left unchanged on their side (or vice-versa). That results in the base and their side matching the entire file as a pairwise match, but we need those smaller all-sides matches that surround our edits. Doh!

However, while we can't drop the file entirely, perhaps we can use the knowledge of an entire file match to do a fast-path implementation of the tests for overlaps and subsumption - so those rules can be sped up a fair bit in some cases. We still have to drink from the firehose when looking for small matches in large files, but we might be able to keep up...

Another gotcha - overlap testing (which is distinct from subsumption) can't use an entire file match as a fast path. Think about it...

The existing code does overlap testing first and then subsumption, so this needs a bit of thought...

Could try swapping around the overlap and subsumption tests and hope that the fast path on the subsumption knocks out enough potential matches to avoid hitting too many overlap tests...

Given the larger file sizes of 2581, 2612, 4288, 5076 and 15225, how did matching at size 8 yield 1695541 potential matches? When we get to a match size of 4, it is plausible that there might be lots of ambiguous matches, but isn’t 8 enough context to force unique matches?

If so, where is the ~10x increase in matches versus tokens coming from?

As of Git commit SHA: 32454ce4da9d24c9229446fa91b12f7c56d329bd, got a modest performance benefit from the fast-path work. The latest manual benchmark has gone from 2 minutes 20 seconds, 2 minutes 33 seconds and 2 minutes 26 seconds to 1 minute 47 seconds, 1 minute 46 seconds and 1 minute 46 seconds.

Overlap testing is now done after subsumption, which gives the subsumption fast-path a chance to kick in (there is no equivalent fast-path for overlaps).

Two things:

It is clear that optimising the subsumption tests can have an impact on performance - in fact, given that the overlap tests are still being hit even with the fast-path tests in front of them, then any kind of optimisation shared by both subsumption and overlap testing should yield a decent performance boost.
As mentioned above, why are there so many potential matches? It's worth revisiting this. Maybe the ambiguous match threshold should be bumped right up...

OK, let's try --minimum-ambiguous-match-size=40 ...

The manual benchmark takes 1 minute 50 seconds, 1 minute 52 seconds. This line of investigation isn't leading anywhere.

No, spoke too soon.

The fundamental performance hit is from having lots of potential matches - and they are mostly due to ambiguous matches.

Outfitting the matching code with some debug code reveals the awful truth:

Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/java/com/sageserpent/americium/java/junit5/TrialsApiTests.java,
  startOffset = 3750,
  size = 4
, label = "BASE: e6f3b850", content = "final TrialsFactoring.TrialException ")", count: 64.
Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/scala/com/sageserpent/americium/TrialsSpec.scala,
  startOffset = 8558,
  size = 4
, label = "BASE: e6f3b850", content = "value % 3 =")", count: 49.
Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/main/scala/com/sageserpent/americium/TrialsImplementation.scala,
  startOffset = 482,
  size = 4
, label = "BASE: e6f3b850", content = "](\\n            generationOperation: ")", count: 4.
Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/scala/com/sageserpent/americium/TrialsSpec.scala,
  startOffset = 6226,
  size = 4
, label = "BASE: e6f3b850", content = ",\\n        api.alternate")", count: 900.
Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/scala/com/sageserpent/americium/TrialsSpec.scala,
  startOffset = 1482,
  size = 4
, label = "BASE: e6f3b850", content = ".supplyTo(println")", count: 2209.
Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/scala/com/sageserpent/americium/TrialsSpec.scala,
  startOffset = 8350,
  size = 4
, label = "BASE: e6f3b850", content = "trials...\\n\\n      ")", count: 4.
Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/scala/com/sageserpent/americium/TrialsSpec.scala,
  startOffset = 10418,
  size = 4
, label = "BASE: e6f3b850", content = "),\\n          (integerVector")", count: 36.

These are base sections that form part of all-sides matches prior to being rejected by the various overlap and subsumption tests. Take this entry, for instance...

Section in: "Section(
  path = /Users/gerardmurphy/IdeaProjects/americium/src/test/scala/com/sageserpent/americium/TrialsSpec.scala,
  startOffset = 1482,
  size = 4
, label = "BASE: e6f3b850", content = ".supplyTo(println")", count: 2209.

... looking at the source file in Americium reveals 47 occurences of this in the file TrialsSpec. That means that we were doing well at only dealing with 2209 potential matches, there are actually 47 ^ 3 of them to worry about, a mere 10823.

Fortunately, in the code used to generate these results, there is now a pre-filtration using the overlap tests, so we only have to wade through 2209 of them (more on this later).

These aren't isolated occurences - the number of ambiguous matches at low window sizes in the manual benchmark is staggering. A lot of these are code snippets that are cut and paste in similar tests in TrialsSpec, and there a bits of import statements, which tend to recur across the various files in the Americium codebase. Other hot favourites are the type SupplyToSyntax[Case] and the slice of the multiple invocation ).withLimit( that tend to crop up in multiple files.

The reason for --minimum-ambiguous-match-size=40 not making a difference is because the code used to run the benchmark referenced in the previous comment has to wade through all the potential matches prior to deciding whether to throw away ambiguous matches - so the logic is to validate all potential matches, then apply the ambiguity threshold after winnowing out the potential matches that are overlapped or subsumed. That results in the very tiny sliver of matches (some of which are still ambiguous, but we're talking about 4 mutually ambiguous matches here).

Some more ideas were tried out in addition to the fast-path work mentioned above...

In Git commit SHA: f55f27249a19f0737bb28da552a4dc7fada1e8fc, the code making the potential matches via Cartesian products of sections was cleaned up - although the intent was originally to evaluate the Cartesian product lazily, this had performance bugs.

This improved performance to 2 minutes 2 seconds, 2 minutes 0 seconds - it was based off the same parent commit ae5d09ed55b71ab298833ad458315d241bf72859 that the fast-path work was started on, so it doesn't have the fast-path changes. So a modest improvement in a parallel tack.

This was improved in Git commit SHA: 31cf07aa693e9054c0e3bf146ffa75e8ed0b8b78 by hoisting the overlap tests in the opposite direction from the fast-path work - moving them right up prior to the Cartesian product to thin out the number of combinations.

That improved performance to 1 minute 42 second, 1 minute 38 seconds and 1 minute 28 seconds, beating the fast path line of work.

Merging the fast-path line of work in with the Cartesian product line in 177351d67d62a15b6cfe41b2443806ad1c6f527c yielded 1 minute 27 seconds, 1 minute 24 seconds. A slight synergy between the two approaches, perhaps?

It is this last variation that was used to generate the debug output above - given the thinning from 10823 to 2209, hoisting the overlap tests had an effect!

Another idea tried in Git commit SHA: 46bb3fc50772b6a3b6eedd5b69ce040633524f30 was to play fast and loose and split off the Cartesian product from the subsequent validation and generation of matches, so the check for the ambiguity threshold could be interposed between them as way of drastically thinning out the Cartesian product.

The results were very encouraging: 46 seconds, 42 seconds and 40 seconds.

Suprisingly the other tests didn't break - but I'm concerned about the correctness of giving up at the first hurdle as soon as some ambiguous match is detected when that match would have been rejected anyway. It means that valid ambiguous matches are swept under the carpet.

Now, given that the whole reason why --minimum-ambiguous-match-size was precisely to sweep such valid ambiguous matches under the carpet anyway, and given that this drastic thinning only takes place when said command line flag is used, why worry?

Need to merge all three approaches together and check what happens when the flag is / isn't supplied...

Stop press - merging the fast-path tests in to the Cartesian product plus hoisted overlap tests gave a performance regression when --minimum-ambiguous-match-size=10 was omitted.

So the best commit so far is at a2e43ce2b8bc293fd8b166e289fd1be853b75c40 - this is the merge of Cartesian product splitting from match validation with overlap testing being hoisted in front of the Cartesian product.

Omitting --minimum-ambiguous-match-size=10:

That gives 1 minute 36 seconds, 1 minute 33 seconds, 1 minute 34 seconds, 1 minute 34 seconds, 1 minute 33 seconds, 1 minute 33 seconds.

The merge of all three approaches gave 1 minute 55 seconds, 1 minute 57 seconds, 1 minute 56 seconds, 1 minute 55 seconds and 1 minute 53 seconds.

Using --minimum-ambiguous-match-size=10 gave 39 seconds and 41 seconds for the best commit, and 43 seconds for the merge of all three approaches.

Of note is that the use of --minimum-ambiguous-match-size=10 gives a slightly better merge - including ambiguous matches results in stray asterisks being orphaned in the class CasesLimitStratgey, this was observed in ticket #26. Preventing ambiguous merges yields a perfect merge.

On that basis it's worth making the default for --minimum-ambiguous-match-size into a positive value, let's think about this...

Messing around --minimum-ambiguous-match-size:

Set to 4, we get a slightly flawed merge, the same as when all ambiguous matches are allowed.

Timings are 1 minute 38 seconds, 1 minute 34 seconds, 1 minute 39 seconds and 1 minute 34 seconds.

Set to 10, we get a good merge.

Timings are 47 seconds, 40 seconds, 40 seconds and 41 seconds.

Set to 20, we get a good merge.

Timings are 39 seconds, 36 seconds, 38 seconds and 40 seconds.

There is a slight performance benefit with the increase in threshold from 10 to 20, but it doesn't justify the potential loss of accuracy in the merge, especially as in the future we may want to support divergent move / collision resolution.

So --minimum-ambiguous-match-size defaults to 10.

Going to close this so that a release can be made.

Further work on performance belongs to #35.

Delivered in Git commit SHA: 503533fe26d18e2deca176f67971c11b44057a22.

sageserpent-open / kineticMerge

Improve performance for inter-file code motion analysis. #34