sageserpent-open / kineticMerge

Merge a heavily refactored codebase and stay sane.
MIT License
9 stars 1 forks source link

Weird merge results. #19

Closed sageserpent-open closed 5 months ago

sageserpent-open commented 9 months ago

Doing some dogfood testing of Kinetic Merge on its own repository yielded this peculiar (and incorrect) successful merge result:

Screenshot 2023-10-10 at 13 36 52

(The left hand is the base, the middle is what Git's merge produced and the right hand is what Kinetic Merge produced).

Doing the same with Git's merge produced a conflicted merge, which to be honest seems a better result.

Need to analyse this and decide:

  1. Is this a logical consequence of the underlying merge approach at fine granularity? So it works 'correctly', but doesn't produce reasonable results based on a human's expectations...
  2. ... or is the merge approach actually broken? So it's not tested properly and has a bug in the implementation.
  3. What should have been the desired result?
sageserpent-open commented 9 months ago

Need some example integration tests, we already have ProseExamples, but we need some more realistic examples where code is merged.

sageserpent-open commented 8 months ago

Here's what the merge is doing...

Explanation (leading indents omitted, as they do not affect the overall argument):

BASE:

val _ = publishLocal.value; (rabinFingerprint / publishLocal).value

LEFT:

val packagingVersion = (ThisBuild / version).value

println(s"Packaging executable with version: $packagingVersion")

RIGHT:

val _ = publishLocal.value

Let's give the fragments labels:

Label Fragment
A val
B _
B' packagingVersion
C =
D publishLocal
E .value;
F (
G rabinFingerprint
G' ThisBuild
H /
I publishLocal
I' version
J )
K .value
L println(s"Packaging executable with version: $packagingVersion")

There are some simplifications here: both E and K are actually runs of tokens, but the principle is the same when applied to the individual tokens. Also, any whitespace is folded into the preceding token as an optional trailer, this isn't shown in the table.

sageserpent-open commented 8 months ago

Carrying on, what we have is a merge involving the following sequences of labels:

Side Sequence
BASE ABCDEFGHIJK
LEFT AB'CFG'HI'JKL
RIGHT ABCDK

A, C and K match on all three sides and thus act as anchors for the merge. B and D match on the base and right; they are anchors. D, F, H and J match on the base and left; they are anchors.

The anchors organize the merge:

A goes through unchanged.

B is considered to be edited to B' on the left, so that goes through as an edit.

C goes through unchanged.

D is considered to be deleted on the left, so does not go through.

E only appears on the base, so does not go through.

F is considered to be deleted on the right, so does not go through.

G is surrounded by the same anchors as G', but does not appear unchanged on either the left or right. It is considered to be deleted coincidentally, and G' is inserted, rather than as an edit conflicting with a deletion.

H is considered to be deleted on the right, so does not go through.

I is surrounded by the same anchors as I', but does not appear unchanged on either the left or right. It is considered to be deleted coincidentally, and I' is inserted, rather than as an edit conflicting with a deletion.

J is considered to be deleted on the right, so does not go through.

K goes through unchanged.

L only appears on the left, so is considered to be an insertion.

This results in: AB'CG'I'KL

sageserpent-open commented 8 months ago

Substituting back, we get:

val packagingVersion = ThisBuild version.value

println(s"Packaging executable with version: $packagingVersion"

That's what Kinetic Merge produces.

sageserpent-open commented 8 months ago

Discussion

Would a different sequence of matches have made a difference?

Suppose that instead of matching the rightmost .value across the base, left and right, we took the leftmost - that would align:

BASE:

val _ = publishLocal.value (etc)

LEFT:

val packagingVersion = (ThisBuild / version).value

RIGHT:

val _ = publishLocal.value

That would lead to a nice merge where the _ would be edited to packagingVersion, the publishLocal would be edited to (ThisBuild / version) and the etc would be dropped.

Why didn't this happen?

Why were some of the changes treated as standalone insertions, rather than as edits in conflict with deletions?

This comes from an early implementation decision taking into account code motion. The merge algorithm in use here is applied at a token level, but is intended to be applied to sections in the future - these are potentially quite big blocks of code that are either unique across the entire codebase as seen from all three sides, or are matched across sides.

The thinking was that if an entire block of code is preserved on one branch, but disappears or is replaced by a new block of code on the other branch, then this is clear sign of a deletion or an edit on the other branch. If a block of code is dropped on both branches, but just one substitutes a new block of code, this should be interpreted as both branches getting rid of the block and one adding in something new - in effect, a coincident deletion, and one side inserts. So the insertion is not considered to be an edit in conflict with a deletion.

Let's pretend we did consider G' and I' to be edits on the left with conflicting deletions on the right ... we'd end up with this:

LEFT PARTIAL MERGE:

val packagingVersion = ThisBuild version.value

println(s"Packaging executable with version: $packagingVersion"

RIGHT PARTIAL MERGE:

val packagingVersion = .value

println(s"Packaging executable with version: $packagingVersion"

That's no improvement at all.

sageserpent-open commented 8 months ago

Revisiting the three questions at the start, we can say that for Q1, the answer is 'yes', and for Q2, 'no'.

The desired result would be the first one shown in the discussion above, that turns out to be a semantically correct merge in this case.

The problem is that the merge algorithm has no idea of the semantics: when it looks for anchors, it tries to maximise the number of matches across all three sides first, and if there is an ambiguity, it will then maximize the number of additional matches between the base and one of the left or the right as a tiebreaker - and that is what led to its preferring to match the rightmost .value, because it could bag the preceding additional matches of (, / and ) into the bargain.

Hmm...

The current behaviour should work well at the section level because we really want to track matching sections; because they are large, we don't expect to see the sort of pathological ambiguity that manifests here.

In the meantime, a workaround for the token-level merge might be to prevent isolated anchors from being set up - so whereas .value consists of two tokens and is fair game for being an anchor, (, / and ) are just single tokens and are therefore rejected as potential anchors. That might lead to a better merge, regardless of whether .value is matched leftmost or rightmost.

sageserpent-open commented 8 months ago

A quick spike was done that penalises the partial matches when tie-breaking (at the expense of breaking LongestCommonSubsequenceTest).

Change from diff:

***************
*** 98,104 ****
        .modify(1 + _)

    def size: (Int, Int) =
!     commonSubsequenceSize -> (commonToLeftAndRightOnlySize + commonToBaseAndLeftOnlySize + commonToBaseAndRightOnlySize)
  end LongestCommonSubsequence

  object LongestCommonSubsequence:
--- 98,104 ----
        .modify(1 + _)

    def size: (Int, Int) =
!     commonSubsequenceSize -> (commonToLeftAndRightOnlySize - (commonToBaseAndLeftOnlySize + commonToBaseAndRightOnlySize))
  end LongestCommonSubsequence

  object LongestCommonSubsequence:

This penalises partial matches from the base with one of the left or the right, but still encourages partial matches between just the left and right (penalising those as well results in another poor merge).

The end result was a conflict, not the desired result, but getting there (and better than Git's merge result);

<<<<<<< fix-weird-merge_results
      val packagingVersion = (ThisBuild / version).value
=======
      val _ = publishLocalThisBuild / version).value
>>>>>>> brutallyInlinedRabinFingerprintSources

      println(s"Packaging executable with version: $packagingVersion")

That is encouraging, but I'm concerned as to whether throwing out base/left and base/right partial matching is generally a good idea - it worked nicely here, but we do want to recognise edits and deletions performed on either the left or the right. The probable saving grace is that these partial matches are only penalised in the context of being a tiebreaker when rival common subsequences share the same number of full matches - so only when there is a choice between choosing one full match versus another does this kick in.

It's worth taking a further look...

sageserpent-open commented 8 months ago

Tried generalising LongestCommonSubsequence to yield alternative results; this involved pruning a plethora of alternatives that share the same common subsequence but differ in strange and wonderful ways involving partial matches and outright differences. The merge is mapped over these alternatives and the longest merge is chosen.

Despite umpteen choices of size metric for LongestCommonSubsequence and thinning criteria for the alternatives that shared the same size, I still haven't see a decent merge - the best so far is the same as the one in the previous comment.

So this approach won't work.

Perhaps this might work if combined with the idea of rejecting potential anchors that are isolated, need to think about this some more....

sageserpent-open commented 8 months ago

After a lot of experimentation with various schemes involving variations in size metric for the longest common subsequence, allowing alternate solutions (and ways of thinning them out), the conclusion is that they either don't work or cause an explosion in compute time for a practical merge.

One thing that wasn't tried was to notion of barring edits / deletions of isolated tokens, but this would have to be combined with alternate solutions anyhow, and that seems to be a non-starter.

sageserpent-open commented 8 months ago

Closing this as 'wont-fix'; time to concentrate on what this tool is supposed to deliver - merging in the presence of code motion, which means supporting the notion of sections, which should avoid these odd merges. All useful work done on the way will be merged in to main, though, and branch alternate-solutions-spike will be left awhile in case it needs revisiting.

sageserpent-open commented 8 months ago

Reopening!

For work on merging in the presence of code motion to proceed, we need section support and the code analysis piece to be functional. If they exist, then it is feasible to use them to do a simple Git-style merge correctly, which is what this ticket is all about.

So this is now a vehicle for reintroducing sections and actually implementing CodeMotionAnalysis...

sageserpent-open commented 7 months ago

The story so far, as of Git SHA: d862da78c1028f097a8345d0eb9aec516b435f4f...

CodeMotionAnalysis and its retinue have been reinstated; these have been cutover to abstract over the content in a Section by making it a generic trait - instead of rendering content as String, it now yields IndexedSeq[Element], where Element is the type parameter. The existing test was cutover accordingly.

The plan is to cannibalize the use of Rabin Fingerprinting employed by PartitionedThreeWayTransform to allow matching sections to be detected in CodeMotionAnalysis. These sections will be optimised so that the matching sections cover as much as possible in the Sources.

A genetic algorithm will be used to perform the optimisation - a general purpose implementation, Evolution has been written so as to abstract over its chromosome and phenotype representations, and is tested with EvolutionTest against a model problem, namely the sorting a sequences of long integers using a fitness of how many pairs of adjacent elements in a given sequence are the wrong way around.

After a lot of headscratching, EvolutionTest passes, but it is clear that the genetic algorithm requires configuration tweaking to suit the particular problem domain it is applied to - it's not going to be a case of dropping it in and waiting for the profit. One thing observed (see branch hammingWall) is that the choice of breeding (aka crossover) and mutation operations - which is tied in with the chromosome representation - has a big impact on whether the genetic algorithm terminates with a globally optimal solution, or just a pretty good second-class solution.

I've seen a lot of failing trials of EvolutionTest that yield a partially sorted sequence made out of two or sometimes three long runs of completely sorted elements that can't get past the Hamming Wall. Some experiments were made with changing the chromosome representation to allow different kinds of breeding, but these simply moved the onset of the Hamming Wall. In the end, a failed experiment with a new kind of mutation operator being applied to the orginal chromosome encoding was recast as a breeding operation, and that turned out to be very good at avoiding Hamming Walls.

However, turning up the dials on trials generation in EvolutionTest yields more Hamming Walls that require increasingly demanding configuration tweaks to get past them, so for now I'm taking the view that as long as the genetic algorithm can be shown to provide globally optimal solutions for 500 trials of moderate sequence size, and that it failures are pretty good solutions, then it should be good enough got section optimisation - we don't need a globally optimal section matching, just a good one.

sageserpent-open commented 7 months ago

Once the section matching and genetic algorithm have been wired up into CodeMotionAnalysis (and hopefully everything works first time!), PartitionedThreeWayTransform will probably be deprecated; while it is very neat and makes token-based merging practical, the expectation is that the number of sections generated will be small enough to allow merge to work directly with sections.

We'll see...

sageserpent-open commented 7 months ago

Chromosome representation: a sorted set of window sizes

Phenotype representation:

  1. The number of window sizes used to generate the phenotype.
  2. Pairs or triples of matching sections across the base, left and right sources, grouped by (section size, whether the match is a pair or a triple) so that the groups come in descending size order with pairs considered worse than triples.

Pairs whose section sizes are the same are grouped together regardless of whether they are base-left, base-right or left-right.

Where matches are ambiguous, all combinations are taken into the group. There is no notion of ordering of sources in a pair or triple - it just the choice of sections from each sources side that matters. So if two sections in the base sources match three sections in the left sources and just one section in the right sources, that yields six triples across all three sources.

Where a pair or triple refers to at least one section that is contained inside corresponding sections in a corresponding pair or tuple for a longer window size, it is removed from the sequence as redundant. There is a subtlety here in that not all the sections involved in the pair or triple need to be contained in another match's section - this means that larger matches can trump potential smaller ambiguous matches, which never results in more ambiguity, but should often reduce or eliminate it.

Fitness is defined as the sequential (lexicographical, tiebreaker) ordering of the groups of pairs or triples sorted by descending order of section size and with pairs being less than triples, with a final tiebreaker being that the lowest number of window sizes wins.

So a larger group of matches at a given size and type is better, a group of triples trumps a group of pairs at the same size, a smaller group at a greater size trumps a larger group at a lesser size, and a more economical use of window sizes is the final tiebreaker.

NOTE: Given that the Rabin Fingerprinting code will find all the sections across the sources for a given window size, it can be argued that the size of a given group is redundant, because the size will never vary for a given section size and choice of pair versus triple. However, we will leave it in as a debugging aid, especially when there are ambiguous matches.

sageserpent-open commented 7 months ago

One mutation adds a new window size to the sorted set chromosome - choose a random integer from the minimum permissible (use the parameter minimumSizeFractionForMotionDetection passed to CodeMotionAnalysis.of) up to the maximum file size. This must be a size that wasn't already present in the chromosome; we insist as a guard that we won't fill up all possible sizes.

Another mutation is to remove an existing window size.

Breeding consists of choosing a random partition value from either one of the chromosomes - one of the chromosomes is filtered to be <= the partition value, the other > and the filtered results are combined into a new sorted set.

sageserpent-open commented 7 months ago

While we're at it, should minimumSizeFractionForMotionDetection be changed to an actual size?

This seems a lot easier to think about, especially when the same section of code might live in a tiny file in one Sources and in a gigantic behemoth in another - so using the same fraction to decide whether to look for it in both sources seems a bit daft.

The only gotcha is that the 'size' is really a count of tokens, not of characters - and it should stay that way given the notion of whitespace insensitivity.

Let's punt on using it to determine the minimum window size globally across all the files in all the sources, but apply a filter when windowing within a file, using the fraction to make a threshold for that file in isolation. So a given small window size may be applied in small files but not in large ones.

sageserpent-open commented 7 months ago

Just written my third comment in the code mentioning how uneasy I am about not modelling the ownership of a Section in its API. This means that client code has to keep some context as to what path is associated with a section; there is no way of asking a section for its owning File (which is also cut off from its associated path) or getting the associated path directly from the section.

Instead, the client code has to keep context by using maps from paths to files and thence to sections, or build inverse mappings from sections back to paths. For now I'm sticking with this approach as I don't like the consistency issues with having mapped objects containing keys; once again I want to see a quasi-map where a collection can be queried by a key type that is an intrinsic property of the contained elements.

sageserpent-open commented 7 months ago

Revised the API for Sources in f78a3b54af1ef2f0602731d820fdf9ada57f1539, which allays some of the above concerns.

sageserpent-open commented 7 months ago

A thought - did some profiling of Evolution.of when chugging through EvolutionTest.evolutionSelectsTheSortedSequence. Despite not having any significant IO interaction (unless Caffeine uses offline storage, which I don't think it does by unless configured to do so), CPU usage hovers around approximately > 10%.

Is there scope for parallelising the chromosome breeding / mutation process, or evaluating the phenotypes?

sageserpent-open commented 7 months ago

Checked with the commit prior to introducing Caffeine, and that shows CPU usage at around 7%, so Caffeine is definitely not hindering the CPU utilitisation, although it might require some for its own purposes, eg. cache maintenance, or it allows other bits of code to make better use of the CPU. Either way, that's no bad thing.

sageserpent-open commented 7 months ago

Getting there - now seeing matches appear, but not all of them. More on that in a bit....

Had to rework the assertions in CodeMotionAnalysisTest.matchingSectionsAreFound as review based on a spate of test failures showed they didn't correctly capture what expected behaviour, and there is an outstanding review item on the generation of the threshold parameter minimumSizeFractionForMotionDetection with a shelf for it.

Prior to cutting over the test assertions, I was seeing a mixture of test case rejections due to what turned out to be faulty assertions and what seemed to be failures caused by overlapping sections being matched, which was ascribed to the test cases being too sloppy about preventing unanticipated matches in addition to the ones being asserted on, the hypothesis being that these accidental matches were causing ambiguous overlapping matches with each other and with the expected matches.

Since cutting over the test assertions, no such rejections have been observed, so it may be worth removing the test case rejection code put in subsequently when overlapping sections are detected!

The current test case failures fall into three categories:

Apart from that, there is still the question of new test failures once this crop has been dealt with, and the need for still more optimisation - the test cases are still restricted in size, need to lift that before declaring CodeMotionAnalysis ready.

Oh, and I've lost all confidence in CodeMotionAnalysisTest. It is not feasible to backpatch the changes made to the test to apply to a starting stub implementation of the SUT to confabulate a nice clean TDD flow, which is what I'd normally do for the sake of belt-and-braces. Instead, I'm going to resort to fault injection and coverage testing once the dust settles...

sageserpent-open commented 7 months ago

It turns out that I had ignored tests filtered out in IntelliJ. Doh. While shrinkage is proceeding there is only one ignored test case, so it was easy to miss - but temporarily disabling all assertions to keep the trials passing results in plenty of ignored test cases due to section overlapping.

A ham-fisted way of avoiding these would be to assign different element ranges to each sequence used to build the common and unique element runs in a test case, so there is no possibility of accidental matching between sequences. This loses some generality, so I'm going to hang fire for now on this...

sageserpent-open commented 7 months ago

After what seems like an eternity (17 days!), CodeMotionAnalysisTest.matchingSectionsAreFound passes - been a mixture of slowly building up internal scaffolding in the SUT, refactoring of Sources, a mixture of debugging and optimisation in the SUT competing with changes to the test as I realised it had bugs both in terms of assertions and test case synthesis. Also discovered some interesting edge cases by testing that required new assertions and implementation work.

However, the tests pass!

There is a known bug in the test case synthesis where if there are no common sequences in the TestPlan instance, then it is possible for accidental matches to creep in with a low probabliity. So far this has only been seen right at the end of test case shrinkage, so I'm sweeping this under the carpet for now as it doesn't show up in a 'healthy' run.

What's next?

1. Remove some potentially redundant pattern match cases when matching sections - these were put in when looking at pairwise matches subsuming all-sides matches, but I'm not sure they are necessary after all. Turns out these are necessary and their lack causes test failures. 2. Rework the chromosome mutation code, and try to increase window size instead of decreasing - using the geometric mean of the largest window size and the maximum window size as a permissible size range seems a good idea. Should also use the slots abstraction from Americium to avoid hitting existing window sizes. DONE: results in a good performance boost. 3. Actually breed chromosomes instead of just selecting one or the other. DONE: some performance improvement from this. 4. Optimise - a lot. It takes a long time to construct a single Rabin Fingerprinting polynomial instance - perhaps these could be serialised and bundled as a resource? DONE ENOUGH FOR NOW 5. Reinstate the original test plan limits - they have been cut down to get results in less than geological time. DONE 6. Finesse the selection of threshold fractions for matching in the tests. DONE 7. Reinstate the lookup between sections and their matches so that there is just one per CodeMotionAnalysis, instead of one for each side in the analysis. This split across sides was only done because of a hunch about sections from different sides being able to compare equal as keys - need to revisit this.DONE

sageserpent-open commented 6 months ago

Still working on the fourth and fifth items from the list above - optimising, switching the tests back up to large sizes and back.

It is now clear that the genetic algorithm plus Rabin Fingerprinting approach works for comparatively small test cases where matches involve up to 30 elements or so, but this is way below what Kinetic Merge will have to deal with in practice. Various tweaks have made it in, with some mild performance improvements, but turning up the expected common sequence sizes to around 1000 kills the performance completely - I'm not even sure if a correct result would even be found, as I've yet to see the algorithm terminate.

What has become glaringly apparent is that there is a natural flow to improving the solutions that the genetic approach doesn't capture:

  1. There is a longest match size that cannot be contradicted by subsequent additional match sizes, so once it has been found there is no point in throwing it away - and indeed because it can subsume other potential matches, it pays to keep it once we've found it.
  2. What often happens is that a shorter match is found first that would be subsumed in a larger match forming part of the optimal solution. When this happens, there are actually a volley of such shorter matches that overlap sequentially by one element, because the shorter matches cover successive parts of the as yet unknown larger match. This creates a vast amount of churn, but can hint at the presence of the larger match. Checking that matches overlap sequentially on all of their sides is expensive (and needs to take into account pairs versus triples), so in practice this is a case of counting the number of matches at a given size as a hint and no more, because some if not all of those matches might not overlap sequentially or even at all.
  3. Once the longest match has been found, it makes sense to back fill, so there is some merit in separating growth into an initial phase of looking for the longest match and then back filling once that has been achived.

Hacking the evolution strategy and chromosome implementation in various obscure ways to acknowledge this flow doesn't yield convincing results though - the genetic algorithm is intrinsically too haphazard to observe that flow accurately enough to converge fast.

Another hit is the calculation of the Rabin Fingerprints - they take a long time, many seconds for large window sizes, and that really has an impact.

sageserpent-open commented 6 months ago

What to do?

  1. Admit partial defeat and change the chromosome representation over from window sizes, using Rabin Fingerprinting to compute the phenotype to incorporating putative matches directly in the chromosome, so the phenotype is evaluated by doing subsequence comparisons across the match's sides. Mutation would involve expanding, contracting or moving the boundary of a match in the chromosome, adding a new match or removing an existing one. Breeding would select matches from either of the parent chromosomes. Both mutation and breeding would incorporate a cleanup step to fuse adjacent matches or remove subsumed matches. The initial population would be bootstrapped by using Rabin Fingerprinting to find initial matches at various sizes.
  2. Admit partial defeat and throw out the genetic algorithm, replacing it with a systematic approach that looks for the longest match, then the next longest etc.. This fits the flow above, can use hinting to speed up the process of going up in match size to find the next tier of match and naturally deals with subsumption as each match search already has the optimal larger matches as context to weed out subsumed matches.
  3. Use some cheap-and-cheerful hashing instead of Rabin Fingerprinting - the code already has to do a final subsequence comparsion to avoid false positives, maybe this would be much faster.

All these plans would reuse a significant amount of what's been written, so there's no need to start from scratch.

Probably a case of trying all of them out on alternate branches, although the cheap hash could be applied to either of the first two options as a tweak.

sageserpent-open commented 6 months ago

There is good news in option 3 above - cheap and cheerful hashing. Doing a like-for-like comparison of performance using a cut down form of CodeMotionAnalysisTest yields a total runtime of 7 minutes 45 seconds / 6 minutes 49 seconds / 6 minutes 29 seconds using Rabin fingerprinting, against 37 seconds / 32 seconds / 34 seconds using simple rolling polynomial hashing in the style of the Rabin-Karp algorithm. Result!

The bad news is that whereas Rabin fingerprinting is pretty robust against accidental fingerprint collisions, the rolling polynomial hash can be quite susceptible to collisions. The implementation of RollingHash has been tweaked to make it pass a collision resistance test, but the high prevalance of collisions in its initial implementation has exposed a latent bug in CodeMotionAnalysis - the logic for finding matches from fingerprints is not robust when a potential all-side match is rejected because one of the sides has a false positive fingerprint match - it rejects the all-side match, but should consider the possibility of a pairwise match between the two sides don't have false positive fingerprint matches.

In fact, it is possible to have a mix of matches that would be false positives if contributions across sides cross over, but make perfectly consistent matches otherwise - so we have two distinct sequences, both of which are replicated across the sides and thus have matches - but both sequences have colliding fingerprints. We have to tease apart the correct matches and not try to attempt the false positives where the two sequences get mixed up in a putative match. The existing code fudges this, missing certain opportunities.

sageserpent-open commented 6 months ago

Testing with match lengths of up 1000 elements yielded a run time of 1 hour 14 minutes for the first test case, 13 minutes for the second, using the simplied chromosome representation + rolling polynomial hash. Not exactly spectacular, but we're out of geological time and into span-of-human-race territory now.

sageserpent-open commented 6 months ago

Merging the rolling polynomial hash into the branch for inferring larger matches within the genetic algorithm yields a run time of 24 minutes for the first test case, 1 minute 57 seconds for the second. Both these test cases yielded slightly suboptimal solutions - some of the matches were just one element shy of the maximal match, thus formed overlapping pairs.

A couple of further test cases actually pass with optimal solutions - the fifth test case in 1 minute and 54 seconds and the seventh test case in 1 minute 8 seconds.

So this would support option 2 above.

sageserpent-open commented 6 months ago

Merging the rolling polynomial hash into the branch that goes for the largest match first and then backfills yields a run time of 1 hour 58 minutes for the first test case. This spent a long time faffing around on the backfilling, so I'm still of the opinion that option 2 is worth trying out.

sageserpent-open commented 6 months ago

Made the rolling polynomial hash reasonably collision resistant and have fixed the latent bug in CodeMotionAnalysis that was exposed by collisions in SHA: 1ed5794805a8825a731bafff78512ba920de50b5.

sageserpent-open commented 6 months ago

The story so far as of commit SHA b4ad6d1c9481e1a5011452c7962fd3b8382e000c: trying out the second plan and using systematic approach yielded a considerable performance benefit; this will be used going forwards. The systematic approach doesn't work well for the end game where window sizes are below the minimum sure-fire window size, however - this is the situation where the threshold for considering a match in a specific file is smaller than largest such threshold across all files (which is the sure-fire window size).

What happens there is that as window sizes decrease down to the minimum across all files, matches can disappear as they no longer clear the relevant threshold. Matches can also vanish as the window size increases when the optimal size of a match is breached, so overall one can observe gaps in window sizes that lead to matches when working below the sure-fire size.

The solution is to use systematic search to do the heavy lifting, efficiently finding all the optimal matches down to the sure-fire window size inclusive, then handing over to the genetic algorithm for the end-game down in the smallest window sizes.

Tracking the inclusion of sections within SectionsSeenAcrossSides (used to detect subsumption of candidate matching sections in larger matches) can get pathologically long when there are a huge number of small matches of size 3 down to 1 - these turn up in some of the nastier test cases where most of the files are large, but there are some small fry whose individual thresholds permit such small matches. There are two related problems - small matches of size > 1 generate lots of matches of digraphs that may or may not be ambiguous, and matches of size 1 churn out a flood of ambiguous matches. Doing the subsumation check involves checking a huge number of sections at each small window size against finger trees that are accumulating all of the slightly larger sections - that can result in doing work proportional to the product of the number of small sections at each small window size.

There is ongoing work to address this, some progress has been made but it may be prudent to enforce a global lower bound on window sizes to avoid this issue - do we really want to have one-element matches? It depends on how exotic the content of the match is...

One solution that has limited success is to condense overlapping sections together as they are added to SectionsSeenAcrossSides - this currently doesn't affect the matches, but hoovers up a lot of the small overlapping sections, replacing them with a much smaller number of disjoint sections.

Another solution with limited success is to limit the number of ambiguous matches - given that at some point a human will have to manually resolve these, it seems pointless to burden them with more than a hard-wired limit of 5 for any given window size. There is no value is asking someone to manually resolve several hundred ambiguous matches of size 1!

Both of these strategies are in place and work well at least most of the time, however test case 22 of CodeMotionAnalysisTest.matchingSectionsAreFound is an example that currently holds out - it takes 37 minutes to execute for the tests as of the aforementioned commit SHA.

The rest of the test cases take from a couple of seconds to a minute or two; still room for further optimisation, but we're getting there now.

sageserpent-open commented 6 months ago

Computing matches for very large window sizes can possibly be skipped by using the condensation of sections added to SectionsSeenAcrossSides - if several sides have matching condensed sections, this can be reported to the systematic search, implementing the hint mechanism efficiently by going directly to an optimal match. Might be worth a shot once the dust has settled on the small sections issue.

sageserpent-open commented 6 months ago

Also, the code has been hacked a fair bit to implement systematic search - it needs cleaning up and more commentary.

sageserpent-open commented 6 months ago

Given that sections of size 1 cannot overlap each other, or subsume other sections, it seems pointless to add matches of size 1 into SectionsSeenAcrossSides.

sageserpent-open commented 6 months ago

For small matches it is likely that we will see lots of trivial ambiguous matches that are just for each single element value, or digraphs or trigraphs. Most of these will be suppressed by the 'interesting' matches of much larger size, but this creates noise in the files whose thresholds fall way below the sure-fire window size.

Now, if there is some important token that only occurs in a single place on a side - eg. a string literal - then it's worth tracking its match, bit otherwise we'll just see either a heap of ambiguous matches of size one for each element or lots of clumps of ambiguous matches for digraphs and trigraphs.

Perhaps the limit on ambiguous matches should be relaxed completely for file size above the sure-fire size inclusive, and set to precisely one for those that come below?

Update: the last idea worked well enough for test case 8 but made little difference for test case 22. In the end, a rather draconian solution of simply discarding all small fry matches as soon as the number of matches exceeded the sure-fire window size was adopted and ambiguous matches are not limited in their own right, rather solely via the new approach. The logic is that if we have several matches at a small window size, then if these are due to overlapping partial matches, the largest optimal match encompassing all these partial matches must have size no greater than the sure-fire size. If there are more matches than that, they are treated as noise and discarded.

sageserpent-open commented 6 months ago

As of commit SHA: 9e6b051ba33db3a86dcfab3abed6ecc4616ec1a9, the genetic algorithm endgame is reasonably proof against pathological test cases that would involve lots of matches at window sizes of around 1 .. 4. A timeout is used to ensure that the genetic algorithm doesn't get stuck on umpteen retries when the aforementioned strategies have stopped it making further matches.

The dreaded test case 22 now takes 2 minutes and 29 seconds / 2 minutes 31 seconds / 2 minutes 39 seconds, of which in the last run, 1 minute 51 seconds was spent on systematic search, 25 seconds on the first window size evaluation at size one and 3 seconds on the final bunch of window size evaluations at 4, 2 and 3.

It's time to optimise the systematic search some more...

sageserpent-open commented 6 months ago

Incorporating optimal match size hints into systematic search yields good performance benefits, even with a crude approach.

For 63 test cases, including the dreaded cases 8 and 22...

BEFORE (864c8f881a10ce8a8e412cacbcbda7f2db86402f):

1 hour 2 minutes for all 63 test cases. 2 minutes 22 seconds for case 8. 2 minutes 37 seconds for case 22.

AFTER (e009d8ca2637bf1b59b763e2847157f01dbc5db6):

34 minutes for all 63 test cases. 1 minute 28 seconds for case 8. 1 minute 53 seconds for case 22.

sageserpent-open commented 6 months ago

As of commit SHA: 9ff24cd6453eb49d87a9f59c57edf644cde8f91d, I'm declaring CodeMotionAnalysis ready for integration into the rest of the Kinetic Merge application.

Now for the fun part...

sageserpent-open commented 6 months ago

It would be prudent to document how to reproduce the weird merge at this point!

Clone the Kinetic Merge repository into a throwaway working copy (use git clone --no-hardlinks if you're cloning from a repository on a local filesystem in case the merge wrecks the repository!)

git clone --no-hardlinks kineticMerge weirdMerge
cd weirdMerge

git tag BASE 0faea64a
git tag OURS fd3b0161
git tag THEIRS bee8a93d

git switch -c destination

Now we're on a branch destination.

Check with a Git merge:

 git merge --no-commit THEIRS

less build.sbt

This results in a conflict in file build.sbt at the top-level of the working tree of the repository:

    packageExecutable := {
<<<<<<< HEAD
      val packagingVersion = (ThisBuild / version).value

      println(s"Packaging executable with version: $packagingVersion")
=======
      val _ = publishLocal.value
>>>>>>> THEIRS

      val localArtifactCoordinates =
        s"${organization.value}:${name.value}_${scalaBinaryVersion.value}:$packagingVersion"

      val executablePath = s"${target.value}${Path.sep}${name.value}"

Excellent - now abort the Git merge...

git merge --abort
git status

The working tree should be clean now.

Now go back to the Kinetic Merge repository for development and use SBT to package the executable:

sbt packageExecutable

This writes the application to a directory under target/kinetic-merge, you should see this in the console output.

Switch to the throwaway working copy and use Kinetic Merge:


<top level Kinetic Merge development repository working tree directory>/target/kinetic-merge --help

<top level Kinetic Merge development repository working tree directory>/target/kinetic-merge --no-commit THEIRS

less build.sbt

This results in a clean merge, this time the previously conflicted part looks like:

    packageExecutable := {
      val packagingVersion = ThisBuild version.value

      println(s"Packaging executable with version: $packagingVersion")

      val localArtifactCoordinates =
        s"${organization.value}:${name.value}_${scalaBinaryVersion.value}:$packagingVersion"

      val executablePath = s"${target.value}${Path.sep}${name.value}"

      s"cs bootstrap --verbose --bat=true --scala-version ${scalaBinaryVersion.value} -f $localArtifactCoordinates -o $executablePath" !

      name.value
    },

(Added in some extra trailing context to match the screenshot from above).

Note the val packagingVersion = ThisBuild version.value - that's the weird part.

Don't forget to tidy up again afterwards...

git merge --abort
git status
sageserpent-open commented 6 months ago

What's needed to integrate CodeMotionAnalysis into the application?

  1. Need to restructure Main so that instead of working path-by-path, it uses CodeMotionAnalysis, then feeds the sections relevant to a path off for merging.CHANGE OF PLAN
  2. Cut over mergeTokens into mergeSections.
  3. Repurpose the helpers for token hashing and equality so that they can be used by CodeMotionAnalysis.
  4. Section equality has to be fed to merge by mergeSections.
  5. Cutover PartitionedThreeWayTransform to use the rolling polynomial hash. This is just a temporary step, as that class is scheduled for deletion, but allows the Rabin fingerprinting work to be removed separately as a clean, easily-revertible commit. DONE
  6. Remove the Rabin fingerprinting sub-project. DONE
  7. Delete PartitionedThreeWayTransform as a clean, easily revertible commit.
  8. Cut over MergeTokensTest to work with mergeSections.
  9. Need some kind of Sources implementation that works with real files.CHANGE OF PLAN
sageserpent-open commented 6 months ago

Looking at the first task in the comment above prompts me to think: "If there are 50 files on the base, left and right, and only two change, then should CodeMotionAnalysis look at all 50 files, or should it concentrate on just the changed, deleted and added ones?"

There is an argument that if a section in a changed, deleted or added file is matched in some incidental file that is the same on the base, left or right, then presumably that is an ambiguous match and needs to be flagged - or maybe the limited scope provides a nice way of avoiding the ambiguity? Hmm....

However, the job right now is make the old path-by-path merge work nicely, so let's not cut over the entire logic of Main to support a global code motion analysis. Instead, we produce an instance of CodeMotionAnalysis for each path where the sources provided to the instance all work on the same shared path, and merge separately for each path.

The sources are constructed directly from the blob contents - these are tagged strings, so no need for anything fancy; the implementation is very similar to FakeSources.

sageserpent-open commented 6 months ago

No, changed my mind again. Looking at the code structure in Main, what happens is that a big swathe of file-level changes across the base, left and right commits are discovered using Git commands and are assembled into a overallChangesInvolvingTheirs: List[(Path, (Change, Option[Change]))], where Change denotes a file modification, addition or deletion, and carries the file content for the modification and addition.

What follows is a breakdown of these changes into conflicts at the path level (mimicing what Git does before it tries to merge content changes), mixed with merge opportunities when a path has changed conflict on at least one side.

If a merge on a path is trivial - so a file has been added on one side (which is always the right - the left side is the destination working tree and thus already has its 'additional' files wrt the base), it is simply added to the Git index, there is no invocation of the content merging machinery.

If a path only exists on the left and right side, then an empty placeholder content is used for the base and the usual three-way merge is performed, this already understands how to perform a two-way merge where appropriate within content, so this is a degenerate case where the entire base content is empty.

So, a simple approach would be to limit our attention to just the changes in overallChangesInvolvingTheirs that involve changes from both the left and right sides (this will have to be generalised to include files added on just the left or right side when the full code motion merging across and within files is delivered). Each path that has changes on both sides is bundled up and fed to a single invocation of CodeMotionAnalysis.of to get a breakdown of sections in a single CodeMotionAnalysis instance.

We can then consult that instance for each per-path merge - so the helpers such as indexStateForTwoWayMerge and indexStateForThreeWayMerge are cut over to work with sections gleaned from the instance.

This means that the tokenisation performed in these helpers has to move upfront prior to the call to CodeMotionAnalysis.of.

The existing mergeTokens is then cut over to become mergeSections, which is still part of the plan above.

Not sure what to do with the test MergeTokensTest - probably just use CodeMotionAnalysis.of to process the single path and then cutover to call mergeSections.

sageserpent-open commented 6 months ago

TODO: go through commentary and decide whether to use side or sources!

sageserpent-open commented 6 months ago

Exhibit A - MergeTokensTest console output as of Git commit SHA 304b99d0222570cc581c2b969757ca4d4690a3ca:

"""
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of small fishing boats;
Astride the sea, beneath the quay,
Rocking and swaying in the breeze.

Why this allusion?
I Havant a clue!
(well, maybe not quite) I thought, 'Was this part of the job role?'.
'Should I be expected to deal with flowers?'
The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but thought only of
raising this in the next Zoom meeting.

For oft, when on my Aeron I slouch
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And sends an email to human resources.
"""
*********************************
"""
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of small fishing boats;
Astride the sea, beneath the quay,
Rocking and swaying in the breeze.

Why this allusion?
I Havant a clue!
(well, maybe not quite) I thought, 'Was this part of the job role?'.
'Should I be expected to deal with flowers?'
The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but thought only of
raising this in the next Zoom meeting.

For oft, when on my Aeron I slouch
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And sashays fishing boatsemail to human resources.
"""
sageserpent-open commented 6 months ago

Exhibit B - CodeMotionAnalysisExtensionTest console output as of Git commit SHA a24dc85be47212f126693f1cfba4da79ddf5d119:

"""I thought, 'Was this part of the job role?'.
'Should I be expected to deal with flowers?'
The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but thought only of
raising this in the next Zoom meeting.

For oft, when on my Aeron I slouch
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And sends an email to human resources.
"""
*********************************
"""
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of small fishing boats;
Astride the sea, beneath the quay,
Rocking and swaying in the breeze.

Why this allusion?
I Havant a clue!
Along the margin of a bay:
Ten thousand (well, maybe not quite) saw I at a glance,
Tossing their heads in sprightly dance.

The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but little thought
What wealth the show to me had brought:

For oft, when on my couch I lie
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And sashays with the fishing boats.
"""
sageserpent-open commented 6 months ago

The good news is that the right hand version of the merged content looks quite good; the bad news is that the left hand version seems to have lost a good chunk of the text. What happened?

sageserpent-open commented 6 months ago

Exhibit C - CodeMotionAnalysisExtensionTest console output as of Git commit SHA 8635abbf5740783db9c4003c519acb5fc72f4006:

"""I thought, 'Was this part of the job role?'.
'Should I be expected to deal with flowers?'
The waves beside them danced; but they
Out-did the sparkling waves in glee:
A poet could not but be gay,
In such a jocund company:
I gazed—and gazed—but thought only of
raising this in the next Zoom meeting.

For oft, when on my Aeron I slouch
In vacant or in pensive mood,
They flash upon that inward eye
Which is the bliss of solitude;
And then my heart with pleasure fills,
And sends an email to human resources.
"""
*********************************
"""
I wandered lonely as a cloud
That floats on high o'er vales and hills,
When all at once I saw a crowd,
A host, of small fishing boats;
Astride the sea, beneath the quay,
Rocking and swaying in the breeze.

Why this allusion?
I Havant a clue!
Along the margin of a bay:
Ten thousand (well, maybe not quite) sashays with the fishing boats.
"""
sageserpent-open commented 6 months ago

My interpretation is that this is down to the 'obvious' match of "I wondered lonely ... A host, of" that would have matched across all three sides being trumped by a longer match of "I wondered lonely ... dancing in the breeze." which matches as a pair across the base and left.

This excludes the right's contribution to the putative all-sides match, so the merge algorithm considers this to be an edit of what is matched across base and left to the what is on the right. Fair enough.

I'm not sure why that edit is only apparent on the right-hand version of the conflicted output though? Is there a bug in how the conflicted merge results were built, or does the merge algorithm genuinely think that the edit is in conflict with something new on the left?

sageserpent-open commented 6 months ago

Running out of 2023 right now, so perhaps a quick explaratory hack of Main is in order - can go through the painful analysis next year.

sageserpent-open commented 6 months ago

A crude hack to quickly integrate CodeMotionAnalysis exists on Git commit SHA c5c055a764f543d4142b53fa8fc3c5e7636e50de.

This breaks MainTest, so something is afoot...