sageserpent-open / kineticMerge

Merge a heavily refactored codebase and stay sane.
MIT License
10 stars 2 forks source link

Orphaned text at bottom of the merge. #23

Closed sageserpent-open closed 8 months ago

sageserpent-open commented 8 months ago

Exhibit A - a change made on the left branch:

Screenshot 2024-01-10 at 16 09 13

Exhibit B - a change made on the right branch:

Screenshot 2024-01-10 at 16 10 31

So the renaming of an identifier referenced within a test method in A versus the renaming of the test method itself in B.

sageserpent-open commented 8 months ago

Exhibit C - the merge:

Screenshot 2024-01-10 at 16 13 32

Note that the referenced identifier wasn't renamed. What happened to it?

Exhibit D - the left form of the partially merged file has an orphan:

Screenshot 2024-01-10 at 16 16 44

NOTE: the left and right panes of the merge triptych are the partially merged sides of the conflict, not the versions of the file as per the left and right changes.

sageserpent-open commented 8 months ago

Using git merge yields the expected merge:

Screenshot 2024-01-10 at 16 24 58

This is a diff of the file on the left branch before and after the diff, so the referenced identifier change is already on both sides.

sageserpent-open commented 8 months ago

Analysis: looking at the sections and matches in CodeMotionAnalysis (btw, should these be logged separately from the mandatory console output via SLF4J?), what happens is that both changes are of a single token, but the one on the left comes later in the file than the one on the right.

This leads to an ambiguity - CodeMotionAnalysis can pick a slightly longer base + left pairwise match to start with, then the change on the left, then a slightly shorter base + left pairwise match, or it can pick a slightly shorter base + right pairwise match to start with, then the change on the right, then a slightly longer base + right pairwise match.

What it won't do is try to break down the matches further into all-side matches with two separate edits, as the all-sides matches are shorter than and enclosed within the pairwise matches.

The pairwise matches overlap, however CodeMotionAnalysis sensibly picks one alternative and goes with that - it is the two base and left pairwise matches, with a trivial section of just whenever between them in the base and whatever between them on the left. These suppress all attempts at breaking down the right hand side, so that it just one big section...

... so, we now have a right edit of the base up to the renaming of the referenced identifier into a single section that covers the entire right file input, followed by a left edit / right deletion conflict - one way, whenever is edit into whatever, the other way, it is outright deleted. Lastly, there is a right deletion of the remainder of the base.

sageserpent-open commented 8 months ago

My feeling is that merge is doing the right thing, but with very unsuitable inputs furnished by CodeMotionAnalysis. Intuition tells me that the breakdown into matches should have been much finer, going for three lots of all-side matches with the change from the left branch as a base + right match leading to a single-token left edit and the change from the right branch as a base + left match leading to a single token right edit.

What stopped this was the (expected) logic in CodeMotionAnalysis that prefers a longer pairwise match to a shorter all-sides match subsumed within it. That makes perfect sense when there is a long pairwise match in isolation that subsumes an all-sides match, eg:

BASE:    zzzzSAFEghghghghghghghghghgh
LEFT:    zzzzSAFEghghghghghghghghghgh
RIGHT:   zzzzINTRUDERghghghghghghghghghgh

NOTE: stuff in upper case denotes a single token - eg. SAFE is one token, whereas stuff in lower case denotes sequences of tokens.

It makes no difference whether we regard this as base + left versus a single section for the right, or break it down into two lots of all-sides matches with an edit sandwiched between them. The fundamental change is in just one place, on the right.

Suppose there are the two changes that lie on either side:

BASE:    zzzzSAFEghghghghghFIZZYhahahaha
LEFT:    zzzzSAFEghghghghghBANGhahahaha
RIGHT:   zzzzINTRUDERghghghghghFIZZYhahahaha

We can choose zzzzSAFEghghghghgh as base + left, plus FIZZYhahahaha as base + right, with edits to zzzzINTRUDERghghghghgh on the right and BANGhahahaha on the left. That makes a clean merge of zzzzINTRUDERghghghghghBANGhahahaha.

Why didn't this happen?

sageserpent-open commented 8 months ago

There is some leading and trailing context prior to changes mentioned above in the Java merge example:

BASE:    oooSAFEzzzzSAFEghghghghghFIZZYhahahahaSAFEyyyyyy
LEFT:    oooSAFEzzzzSAFEghghghghghBANGhahahahaSAFEyyyyyy
RIGHT:   oooINTRUDERzzzzINTRUDERghghghghghFIZZYhahahahaINTRUDERyyyyyy

This is because the right file had a global search-and-replace applied to it, so there are several edits lying on either side of the change made on the left. How does this affect things?

We get oooSAFEzzzzSAFEghghghghgh as base + left, FIZZY as base + right, then hahahahaSAFEyyyyyy as base + left. So in theory that would also lead to a nice merge of oooINTRUDERzzzzINTRUDERghghghghghBANGhahahahaINTRUDERyyyyyy.

This didn't happen - it looks as if FIZZY was treated as being different between the base and the right …

sageserpent-open commented 8 months ago

Um, the clue is in the match threshold - which was 20%, leading to:

minimumFileSizeAcrossAllFilesOverAllSides = 5076
minimumWindowSizeAcrossAllFilesOverAllSides = 1015
maximumFileSizeAcrossAllFilesOverAllSides = 5076
minimumSureFireWindowSizeAcrossAllFilesOverAllSides = 1015

It's hardly surprising that the putative one-token base + right match for token FIZZY didn't take place!

sageserpent-open commented 8 months ago

Even worse, being ham-fisted and setting the match threshold to zero to match everything results in overlapping sections. Section overlaps are suppressed if a section is deemed to be in the small-fry below the minimum sure file size, but that won't happen in this case as the sure-fire size is the threshold size for each file being merged.

sageserpent-open commented 8 months ago

Let's do a thought experiment to see what would have happened if smaller all-sides matches had got in on the act...

BASE:    oooSAFEzzzzSAFEghghghghghFIZZYhahahahaSAFEyyyyyy
LEFT:    oooSAFEzzzzSAFEghghghghghBANGhahahahaSAFEyyyyyy
RIGHT:   oooINTRUDERzzzzINTRUDERghghghghghFIZZYhahahahaINTRUDERyyyyyy

We'll assume the match threshold allows all the all-sides matches, but bars single-token matches for SAFE, INTRUDER, FIZZY and BANG.

That would lead to an all-sides match of ooo, a conflicting edit (really!), an all-sides match of zzzz another conflicting edit (equally dubious), an all-sides match of ghghghghgh, another conflicting edit, an all-sides match of hahahaha, yet another conflicting edit and final all-sides match of yyyyyy.

Now if weren't for the obviously daft conflicting edits - they are really two right-edits, a left-edit and a final right-edit, then all would be well.

sageserpent-open commented 8 months ago

A tentative plan:

sageserpent-open commented 8 months ago

Let's think about what happens when a single large pairwise match encloses several smaller all-sides matches, some of which may be ambigious matches, possibly even with each other. Bear in mind that the section location on the side of such an all-sides match that isn't enclosed (there must be one, obviously) may have no obvious connection to the pairwise match.

Examples:

BASE:  xxxxyyyyxxxx
LEFT:  xxxxyyyyxxxx
RIGHT: xxxxioeruuwoexxxx

So we have a larger pairwise match of xxxxyyyyxxxx between the base and left, and various ambiguous matches involving xxxx. It is not really clear that either of the two xxxx sections on the right have anything to do with the pairwise section and whether they should be thought of as being aligned. All we can hope for is to say they don't overlap, and thus the merge algorithm may decide to interpret these as preservations, doing a right-edit of yyyy into ioeruuwoe. That would give a merge of xxxxioeruuwoexxxx, assuming that yyyy is treated as a smaller base and left pairwise match - or that the final merge treats each y is being equal tokens.

That seems clear enough, but how do we let the ambiguous all-sides matches eat into the pairwise match?

This approach is obvious:

BASE:  xxxxyyyyxxxx
        |       |
LEFT:  xxxxyyyyxxxx
        |          \
RIGHT: xxxxioeruuwoexxxx

So we get a reduced pairwise match of yyyy on the base and left.

What happens if when we look at matches that cross-over the central part?

Hmm, given that all the ambiguous all-sides matches for a given window size are found in one go, that means that if a pairwise match's sections are simply eaten into on each side from all of the ambiguous matches, then the only thing remaining must be what made the larger match pairwise and not all-sides - so it is still valid as a pairwise match or matches.

In this example, we could eat into the base and left sides of the pairwise match in any order and still end up with yyyy on both sides of the match.

Where it gets tricky is something like this:

BASE:  xxxxyyyyxxxxzzzz
LEFT:  xxxxyyyyxxxxzzzz
RIGHT: xxxxioeruuwoexxxxeoierp

Eating into xxxxyyyyxxxxzzzz should yield two pairwise matches for yyyy and zzzz, but how does this emerge if sections are being eaten from arbitrary ambiguous matches?

sageserpent-open commented 8 months ago

Two approaches to eating away from pairwise sections:

sageserpent-open commented 8 months ago

There is a third approach that I'm not keen on - this is to remove pairwise matching in CodeMotionAnalysis and put it all on the merge algorithm to infer left-edits, left-deletions, right-edit and right-deletions. What I don't like about this is that it will stop Kinetic Merge from recognising edited moves further down the line - but that's the whole point of this exercise in the long run!

sageserpent-open commented 8 months ago

Added a bug reproduction test in Git commit SHA 3c69719c63c72085ddee215ac8205817338c7d24.

Had to edit the repeated characters in the example above to real words to avoid overlapping sections - not sure why this was necessary, need to revisit this.

sageserpent-open commented 8 months ago

The overlapping sections problem was down to having two overlapping matches (a base + left and a base + right overlapping on the base sections) that have exactly the same number of tokens, thus the same size. CodeMotionAnalysis does not suppress overlapping sections at the same size, indeed it expects them when a candidate match is a substring of a larger optimal match.

Perhaps this should be revisited later, but for now let's park it and concentrate on this ticket.

sageserpent-open commented 8 months ago

NOTE: remember the comment here: https://github.com/sageserpent-open/kineticMerge/blob/b325f61f8c17d98ec45a487cda3ebe9a526aceef/src/main/scala/com/sageserpent/kineticmerge/core/CodeMotionAnalysis.scala#L78.

Should an empty file be represented by the absence of any sections or a single section of length zero? The former is unambiguous, whereas the latter allows an arbitrary number of zero-width sections to be packed in - in fact, that can even be the case for non-empty files.

Let's add an invariant somewhere that forces a section to be non-empty.

sageserpent-open commented 8 months ago

As of Git commit SHA c487896f9aac4266b6a0f420d1ce5a951848dca9, there is something tantalisingly close to closing this ticket. The bug reproduction test is passing, but after a fair bit of overhauling of CodeMotionAnalysis there are now legitimate section overlap exceptions being thrown by CodeMotionAnalysisExtensionTest when it tries to merge the example SBT sources.

The nub of the problem is the overlap between:

lazy val javaVersion = "14"

ThisBuild / version := "0.1.0-SNAPSHOT"

ThisBuild / scalaVersion := "3.3.0"

ThisBuild / javacOptions ++= Seq("-source", javaVersion, "-target", javaVersion)

lazy val packageExecutable =
  taskKey[String]("Package an executable with Coursier")

lazy val 

and

lazy val root = (project in file("."))
  .settings(
    publishTo              := sonatypePublishToBundle.value,
    pomIncludeRepository   := { _ => false },
    sonatypeCredentialHost := "s01.oss.sonatype.org",
    publishMavenStyle      := true,

On the left hand side there is a stanza:

lazy val packageExecutable =
  taskKey[String]("Package an executable with Coursier")

lazy val versionResource =
  settingKey[File]("Location of generated version resource file.")

lazy val root = (project in file("."))

The first match intrudes partway into this stanza, then there is:

         versionResource =
  settingKey[File]("Location of generated version resource file.")

, followed by the start of the second match.

For the base and right files the two alternative matches overlap on the lazy val part!

sageserpent-open commented 8 months ago

The story so far as of Git commit SHA e3728d861d11dcb371c5c73afc4bcc5ab528fcb5...

  1. All-sides matches are now allowed to be subsumed by larger pairwise matches when searching for matches; once all matches have been found, the subsuming pairwise matches are eaten into by the subsumed all-sides matches to yield a mixture of adjacent pairwise and all-sides matches.

  2. When looking for the largest common subsequence during three-way merging, sections are now checked for equality by virtue of being in the same match or else having the same content. This picks up the smaller sections that don't make the grade when looking for matching sections, but are actually the same across the sides. It is open to debate as to whether this can be dropped if matching is allowed to proceed right down to single elements (tokens) or should be kept as a way of justifying not looking for mono/di/tri-graph section matches (which cause a lot of expensive noise when matching).

  3. Seeing some odd looking merge results in CodeMotionAnalysisExtensionTest.merging when setting the threshold right down to below 0.006 (0.6 %). Presumably we are down to single-token matches here, perhaps that is the reason?

  4. CodeMotionAnalysisTest.matchingSectionsAreFound is once again creating test cases with incorrect expectations - once again, there are coincidental match opportunities it doesn't know about; these render the test expectations invalid. Grr.

Point number three could perhaps be dealt with by banning mono/di/tri-graph matches? We do have the fallback content comparison when looking for the longest common subsequence, after all...

So it seems like we're getting there. Slowly. Diligently. Painstakingly.

sageserpent-open commented 8 months ago

As of Git commit SHA 9da79b3728ba4b0d3e8f76ba9ef1b16d4c93a7ac, have corrected test case generation in CodeMotionAnalysisTest.matchingSectionsAreFound, but have exposed a bug in CodeMotionAnalysis where an assumption about the presence of sections in all sides from an implied all-sides match isn't always valid when eating into a pairwise match.

What can happen is that a pairwise match subsumes a section that is in theory ambiguously matched on all-sides, but only one of the ambiguous matches completely subsumes the pairwise section. If said ambiguous match turns out to be blocked due to the threshold being too high in its file, then the eating blithely chomps on the section on one side of the pairwise match but doesn't have the corresponding section on the other side (because its match was suppressed).

Need to either make the eating of pairwise sections more robust - preferably by working at match level rather than section level, or go back to the idea of detecting all-sides matches first for all window sizes, then doing a second pass for pairwise matches, using subsumed all-sides matches to block subsuming pairwise matches in addition to the usual checks for being subsumed or overlapped.

Food for thought ...

sageserpent-open commented 8 months ago

Git commit SHA f9395e1b8a7467f4777e64f08dfd38eb0f8898af implements a revised and robust way of eating into pairwise sections; along with some major rework of MatchCalculationState and SectionsSeenAcrossSides - these have now merged into a simplified MatchesAndTheirSections, which holds the matches and their sections.

Multiple ambiguous matches may reference the same section on a given side; this is robust against ambiguous pairwise matches where one pairwise match is eaten into by an all-side match but the other isn't - the implementation will generate an overlap between the full section for the pairwise match that isn't eaten into and the smaller sections for the one that is.

The bug described in the previous comment has also been fixed, and CodeMotionAnalysisTest.matchingSectionsAreFound is back to passing with just a few rejected test cases with overlapping match section. Great.

What remains is a couple of failures in MainTest and CodeMotionAnalysisExtensionTest that seem to be down to overlapping sections. These need reviewing - if they are legitimate responses due to ambiguous matches, then either:

  1. the tests should be tweaked so as not to provoke them, if this is because of trivial overlaps from di- and tr-graphs or
  2. overlapping matches should not be permitted at small window sizes, especially if these are below the sure-fire match size.

There is also a long running trial in CodeMotionAnalysisTest.matchingSectionsAreFound - case 56 takes 4 minutes 33 seconds or so to come up with a rejected trial due what looks like a failed suppression of a digraph section...

sageserpent-open commented 8 months ago

All of the tree trial rejections seen in CodeMotionAnalysisTest.matchingSectionsAreFound are due to subsumed sections slipping through the net and being picked up as overlaps.

So are the failures in the other tests. So it's close to being in the bag...

sageserpent-open commented 8 months ago

Tests are passing in Git SHA 34ba3bc0884c2eb8bb23416c3e9e20d5047a74f9!

CodeMotionAnalysisExtensionTest is showing some odd-looking merge results at very low match size thresholds.

This might be due to eed679232a983ff491ac01e72a3fa9a41b6ffd44, which banned monograph matches...

sageserpent-open commented 8 months ago

Spoke too soon - the very last trial in CodeMotionAnalysisTest.matchingSectionsAreFound - the long running one - fails. Grrr.

sageserpent-open commented 8 months ago

Also tried reinstating monograph matches - this neither fixed the failing long running trial , nor did it make the odd merges nicer.

sageserpent-open commented 8 months ago

Finally, on this most auspicuous of days, we have Git commit SHA 4c8390f77cc73e0ce0b5959d0913b597cc1fd8f7.

Behold, all the tests pass!

sageserpent-open commented 8 months ago

Some subsequent housekeeping work done as of Git commit SHA 7505c922bb3d00de9c2913346b57c21af1c42966.

Now to revisit the original example....

sageserpent-open commented 8 months ago

Merging with match-threshold=0.005 yields this as a final diff:

Screenshot 2024-02-01 at 12 10 07

The merge triptych again:

Screenshot 2024-02-01 at 12 13 23

This time the test is renamed on the left and the identifier is renamed on the right.

sageserpent-open commented 8 months ago

Let's try again with the branches other way around, the same way as for the original example...

The diff:

Screenshot 2024-02-01 at 12 26 24

The merge triptych:

Screenshot 2024-02-01 at 12 28 43

sageserpent-open commented 8 months ago

So it looks good. Now, using something like --match-threshold=0.01 yields varying merges that are both poor, but don't feature orphaned test. However, it looks like the problem for this ticket is resolved. Those poor merges can be dealt with in their own ticket...

sageserpent-open commented 8 months ago

This went out in release 0.1.10, Git commit SHA 64fd5448f2f9e6d36953c86345d069d62d1d3a2f.