Closed sageserpent-open closed 1 month ago
A rather horrible idea which I think has already been tried is to fragment unmatched text into single token sections, so that the merge can have a crack at aligning individual tokens.
I have a feeling that if this was tried in the past, it blew the merge algorithm up due to all those tokens - in real-world examples, such unmatched code can have many tokens.
It seems as if the merge should be able to explode unmatched sections and perform a recursive match-and-merge on aligned sections to do this. If it could, that would sidestep the small ambiguous match problem nicely too.
An alternative is to do something hokey like looking for common prefixes and suffixes when unmatched sections are aligned.
One problem is what is meant by aligned? - If the sections don't match and don't have identical content, then the merge algorithm will think there is some kind of editing going on...
Remember PartitionedThreeWayTransform
- perhaps that could be applied to all the unmatched sections in a last ditch attempt at finding small matches?
Either that, or reapply matching to just the unmatched sections - would need to generalise this to support non-contiguous sections…
Perhaps the merge algebra implementation should be tweaked to recognise this situation, then examining the insertion to see if it has a prefix or suffix that matches what would be thrown away from the base. That would allow the merge to rewrite sections on the fly.
That might require some fancy footwork to rebuild the breakdown of files into sections, though.
Let's examine this from the point of view of the merge algorithm, as the use of 'insertion' above is a bit sloppy - that is from the user's point of view (i.e. me casually inspecting the sources).
The log shows:
09:41:55.206 [io-compute-2] DEBUG c.s.kineticmerge.core.merge$ -- Conflict between left deletion of Section(
label = "BASE: 812043a7",
path = /private/var/folders/wh/xq3_z3j945q0j4___fxqb6xr0000gn/T/toyGitRepository4334175726671119423/pathPrefix1/CasesLimitStrategies.java,
start = TextPosition(line = 115, characterOffset = 8),
onePastEnd = TextPosition(line = 119, characterOffset = 0),
startTokenIndex = 593,
sizeInTokens = 3,
content = """};
}
"""
) and its edit on the right into Section(
label = "THEIRS: master",
path = /private/var/folders/wh/xq3_z3j945q0j4___fxqb6xr0000gn/T/toyGitRepository4334175726671119423/pathPrefix1/CasesLimitStrategies.java,
start = TextPosition(line = 117, characterOffset = 12),
onePastEnd = TextPosition(line = 123, characterOffset = 2),
startTokenIndex = 595,
sizeInTokens = 18,
content summary = """@Override
public boolean legacyMethod ... }
};
}
}
"""
).
09:41:55.207 [io-compute-2] DEBUG c.s.k.c.MatchesContext$MergeResultDetectingMotion$ -- Conflict at origin of move: resolved as a coincident deletion of Section(
label = "BASE: 812043a7",
path = /private/var/folders/wh/xq3_z3j945q0j4___fxqb6xr0000gn/T/toyGitRepository4334175726671119423/pathPrefix1/CasesLimitStrategies.java,
start = TextPosition(line = 115, characterOffset = 8),
onePastEnd = TextPosition(line = 119, characterOffset = 0),
startTokenIndex = 593,
sizeInTokens = 3,
content = """};
}
"""
); propagating right edit Section(
label = "THEIRS: master",
path = /private/var/folders/wh/xq3_z3j945q0j4___fxqb6xr0000gn/T/toyGitRepository4334175726671119423/pathPrefix1/CasesLimitStrategies.java,
start = TextPosition(line = 117, characterOffset = 12),
onePastEnd = TextPosition(line = 123, characterOffset = 2),
startTokenIndex = 595,
sizeInTokens = 18,
content summary = """@Override
public boolean legacyMethod ... }
};
}
}
"""
) to left move destination Section(
label = "OURS: condensedFilesBranch",
path = /private/var/folders/wh/xq3_z3j945q0j4___fxqb6xr0000gn/T/toyGitRepository4334175726671119423/pathPrefix1/pathPrefix2/CasesLimitStrategy.java,
start = TextPosition(line = 126, characterOffset = 8),
onePastEnd = TextPosition(line = 129, characterOffset = 4),
startTokenIndex = 689,
sizeInTokens = 3,
content = """};
}
"""
).
09:41:55.208 [io-compute-2] DEBUG c.s.kineticmerge.core.merge$ -- Coincident deletion of Section(
label = "BASE: 812043a7",
path = /private/var/folders/wh/xq3_z3j945q0j4___fxqb6xr0000gn/T/toyGitRepository4334175726671119423/pathPrefix1/CasesLimitStrategies.java,
start = TextPosition(line = 119, characterOffset = 0),
onePastEnd = TextPosition(line = 119, characterOffset = 2),
startTokenIndex = 596,
sizeInTokens = 1,
content = """}
"""
).
The code motion is actually via an edit - so
"""};
}
"""
is edited on the right into
"""@Override
public boolean legacyMethod ... }
};
}
}
"""
Observe the additional trailing closing brace.
Following that, the merge sees a coincident deletion of
"""}
"""
That's our closing brace again.
What should have happened in a perfect world would be for the closing brace to be matched on the base and right, in which case it wouldn't be part of the edit text and thus would not be propagated - in fact it would be considered a left deletion.
Here's a rather hokey way of fixing this:
startOffset
, onePastEndOffset
or size
though - fortunately there is no explicit connection between the size of a section and its actual content in Section
. We don't worry about breaking invariants either because it is the sections belonging to a File
that are expected to be contiguous, having the size computed from the content - the adjusted section simply makes it may through change propagation. The resulting merge isn't bothered about contiguity of sections at all - it expects them to be a mix from the left and right sides, and for code motion to happen.In fact the merge algebra in MatchesContext
can simply track whether a coincident deletion precedes or follows a deletion versus edit conflict, there should be no need to modify the core merge algorithm.
As of 99fc253f7561a0f585c739f6940cb3f3f06fbad4, there is a rather hacked solution that results in a correct merge for the suffix situation. This is also tested at unit level in MatchesContextTest
.
The next thing to do is to add tests for the prefix situation and to clean up the code...
Adding CodeMotionAnalysisExtensionTest.codeMotionWithEditAdjustment
to test this approach reveals a test failure in eeb74ffac67e7bcaa1ebc1b0e9cee6e0458836a3 that was swept under the rug by 55b3654936911cbbc1937be144120460d065b2fb.
The problematic merge inputs are:
BASE:
FishChipsMushyPeasKetchupMuesliToastTeaKippers}NoodlesSandwichCakePudding
LEFT:
MuesliToastTeaKippersFishChipsMushyPeasKetchupNoodlesSandwichCakePudding
RIGHT:
FishChipsMushyPeasKetchupPorridge}CakePudding
EXPECTED:
PorridgeFishChipsMushyPeasKetchupCakePudding
Each food article is a token, as is }
.
The merge log:
10:48:58.005 [main] DEBUG c.s.kineticmerge.core.merge$ -- Left insertion of Section(
label = "left",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 0),
onePastEnd = TextPosition(line = 1, characterOffset = 21),
startTokenIndex = 0,
sizeInTokens = 4,
content = "MuesliToastTeaKippers"
).
10:48:58.010 [main] DEBUG c.s.kineticmerge.core.merge$ -- Preservation of Section(
label = "left",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 21),
onePastEnd = TextPosition(line = 1, characterOffset = 46),
startTokenIndex = 4,
sizeInTokens = 4,
content = "FishChipsMushyPeasKetchup"
) as it is common to all three sides.
10:48:58.012 [main] DEBUG c.s.kineticmerge.core.merge$ -- Coincident deletion of Section(
label = "base",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 25),
onePastEnd = TextPosition(line = 1, characterOffset = 46),
startTokenIndex = 4,
sizeInTokens = 4,
content = "MuesliToastTeaKippers"
) with following right edit.
10:48:58.013 [main] DEBUG c.s.k.c.MatchesContext$MergeResultDetectingMotion$ -- Coincident deletion at origin of move: propagating deletion to left move destination Section(
label = "left",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 0),
onePastEnd = TextPosition(line = 1, characterOffset = 21),
startTokenIndex = 0,
sizeInTokens = 4,
content = "MuesliToastTeaKippers"
).
10:48:58.015 [main] DEBUG c.s.kineticmerge.core.merge$ -- Coincident deletion of Section(
label = "base",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 46),
onePastEnd = TextPosition(line = 1, characterOffset = 47),
startTokenIndex = 8,
sizeInTokens = 1,
content = "}"
) with following right edit.
10:48:58.017 [main] DEBUG c.s.kineticmerge.core.merge$ -- Right edit of Section(
label = "base",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 47),
onePastEnd = TextPosition(line = 1, characterOffset = 62),
startTokenIndex = 9,
sizeInTokens = 2,
content = "NoodlesSandwich"
) into Section(
label = "right",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 25),
onePastEnd = TextPosition(line = 1, characterOffset = 34),
startTokenIndex = 4,
sizeInTokens = 2,
content = "Porridge}"
).
10:48:58.020 [main] DEBUG c.s.kineticmerge.core.merge$ -- Preservation of Section(
label = "left",
path = "*** STUNT DOUBLE ***",
start = TextPosition(line = 1, characterOffset = 61),
onePastEnd = TextPosition(line = 1, characterOffset = 72),
startTokenIndex = 10,
sizeInTokens = 2,
content = "CakePudding"
) as it is common to all three sides.
The twist here is the competition for the right-edit between what could be a delete versus right-edit conflict of MuesliToastTeaKippers
into Porridge}
and a straight right-edit of NoodlesSandwich
into Porridge}
.
The merge algorithm favours straight edits over conflicts, so MuesliToastTeaKippers
is considered to have been coincidentally deleted, followed by a coincident deletion of }
, then the right-edit of NoodlesSandwich
into Porridge}
. The coincident deletion of MuesliToastTeaKippers
is resolved as propagated of a deletion to the move destination, so the final merge looks like:
FishChipsMushyPeasKetchupPorridge}CakePudding
.
So the claim of the right-edit stopped }
from being recognised as a suffix of the edit.
Even if that }
would have been connected with the second coincident deletion, adjusting the right-edit, the right edit is in the wrong place.
Intuitively, we would expect a match on the }
- with this match, we would have a left-deletion of the }
which puts Porridge
in front of the }
, so it has be interpreted as part of a delete versus right-edit conflict, and would thus be propagated. That would then apply a right-deletion to NoodlesSandwich
, leading to the expected balanced diet.
Given that the existing code is pretty hokey, I haven't yet written the logic to handle the suffix situation, and the problem with following straight edits, I'm inclined to deem this experiment a failure.
Suppose unmatched text is just exploded into one-token sections...
Trying that out in f7fb48465518ab05eb68924c0f7e669be997db07 breaks several tests - the existing edit propagation and insertion migration logic expects there to be just one section to shift to its new home, so this won't work without cutting over.
However ... CodeMotionAnalysisExtensionTest.issue42BugReproduction
passes!
Let's plod on...
As of 02d08597bcfe6c949afb1060caa4a0f7d54f24c1, now down to one test failure in CodeMotionAnalysisExtensionTest
- this one is caused by not having cut over the migration of insertions to cope with multiple inserted sections yet.
There are two outstanding failures in MainTest
, these may also involve edit propagation, not yet sure.
Let's sort out insertion migration and re-check...
Musing ... splitting up the gaps between matched sections affords the merge algorithm maximum flexibility in finding last-minute matches between tokens, but it can generate a lot of sections. That might cause the merge algorithm to fall over; so far this hasn't been observed, it seems that one span of unmatched text is always much smaller than the other, so the merge algorithm doesn't tank, although possibly it could be optimised for this.
What could be done would be to look for prefix and suffix matches upfront in MappedContentSources.filesByPathUtilising
, breaking up the gap-filler sections so that shared prefixes and suffixes are hived off into their own matched sections. That may well generate several ambiguous matches, so there might be merit in disabling the code-motion machinery for such matches - all they do is allow the merge algorithm to align prefixes and / or suffixes, and by being matches in their own right, this would allow the existing code-motion framework to carry on propagating / migrating individual edit or insertion sections.
A variation on this is to not consider the prefix / suffix sections as being matched; this still allows the merge machinery to align them. As the rest of the gap-filler section just makes a single token, we can again stick with the existing code-motion framework to do its thing.
The question is, how does this work with migrating insertions and anchoring? We could end up splitting a gap-filler section with a single anchor into several pieces, with more then one piece viable for migration. The problem is, we've lost the connection to the anchor. This can happen even when there is a preceding and succeeding anchor, if the gap-filler is split into 5 sections or more.
Drat.
Current thoughts:
a) Splitting up gap-fills into lots of one-token sections provides maximal opportunity to align tokens via the merge algorithm - and that is crucial to this ticket. b) Doing this before the merge allows the merge to make more sensible decisions. c) So far - but this needs to be revisted with manual testing - feeding the merge with lots of one-token sections isn't killing performance. d) It looks like the treatment of edits is more or less in the bag, might be few final snags to sort out, but hopefully nothing major. e) The only sticking point is how to cope with runs of several insertions when migrating code insertions. Perhaps the way to go is to flip the way the migration logic works - instead of starting with an insertion and finding its anchors (thus assuming that insertions that need migration are isolated from each other and thus part of distinct migrations), start with the anchors, go find any abutting insertions and link them in turn with other abutting insertions to build up a map of anchors to runs of contiguous insertions that need migrating. Or just work on all the insertions associated with a path in one go, finding the anchors and runs at the same time...
NOTE: CodeMotionAnalysisExtensionTest.merge
is subtly broken - visual inspection of the merge shows the desirable output of:
packageExecutable := {
val packagingVersion = (ThisBuild / version).value
println(s"Packaging executable with version: $packagingVersion")
is clobbered:
packageExecutable := {
val packagingVersion println(s"Packaging executable with version: $packagingVersion")
Need to come back to this, commit: f7fb48465518ab05eb68924c0f7e669be997db07 broke this.
Commit: c3b5c35a651398c251a767913304faff2e6f032a brings in support for migrating runs of contiguous insertions to the same move destination. All the tests in CodeMotionAnalysisExtensionTest
now pass, although there is still that subtle breakage seen by manual inspection of the merge output mentioned in the comment above to worry about...
MainTest
is down to a single failure in MainTest.twoFilesSwappingAroundWithModificationsToBoth
. The test MainTest.anEditAndADeletionPropagatingThroughAFileCondensation
now passes, that was the one that required insertion propagation, so that's good.
Hopefully the outstanding failure in MainTest
is the same as the one in CodeMotionAnalysisExtensionTest.merge
...
Also need to tidy up what I did in commit 864fc64ee444fae89a7132c1d4b2b53d142fc0c3 and merge the tidyup in...
Reviewing the logs of the MainTest.twoFilesSwappingAroundWithModificationsToBoth
revealed that there was a latent bug in MatchesContext
- when a conflict is was resolved in favour of a straight insertion with a propagated change, the insertion was not examined as a potential move destination.
Work done in d3569853e97168127a1b834070718bcbd0606f57 and fe944ce84294af5d4d8182e3ed265918793ffff2 addresses that, and merging it in via b0ff57abf492bbff6adf0e5d2c047c360672b42a fixes all the failures.
Great!
Still got the subtle merge breakage in CodeMotionAnalysisExtensionTest.merge
, though. One crumb of comfort is that it only manifests when the minimum match size goes below 6 - otherwise the merge looks good.
Think I'll take it as it is for now, so on to the tidying up...
Tidying up done, merged to main
in f3b353220f1fad04064b4dc99b2d16210e72b71e.
Don't forget to amend MatchesContextTest
to assert on the insertions detected during a merge...
Oh drat - doing the manual test discussed here: https://github.com/sageserpent-open/kineticMerge/issues/30#issuecomment-2106144845 causes the merge to overflow the stack.
Now, does this mean the merge should be rewritten to use continuations / trampolining, or turned into a proper dynamic programming algorithm, working bottom-up, or should the gap filling process be tuned to make fewer sections?
The prefix / suffix matching idea might be useful here...
Got a clumsy workaround for the stack overflow in 6abf07fa5293d2f21e524c1ad350904b7e653b3e, this will do as a proof-of-concept for now.
Looking at the results of the manual merge from #30 etc makes me wonder whether it is better to be more cautious about splitting up the gaps; again the prefix/suffix matching comes to mind.
Probably time for another ticket for that one - have raised #43.
Noticed that MatchesContext
wasn't recording all left- and right-insertions, again those insertions caused by resolution of an edit versus delete or edit versus edit conflict were only partially updating the MergeResultDetectingMotion
. Tests were written, and made to pass in: aedc282e5d802f3082dd2b605694bfc81378ce6f, which main
now references.
This went out in release 1.2.0, Git commit SHA: 4d1935beef2b5923417904895884de8c2e749c6e.
Git commit SHA: d5df1eb361b6c9a1e90fbe85d41142780e17274a bodges the test
MainTest.anEditAndADeletionPropagatingThroughAFileCondensation
so as to allow a closing brace in some Java code to be matched across the base and right sides of the merge.If this isn't done, the closing brace, being a single character, is not matched - so on the right side that adds a method definition in front of it, the method definition and the closing brace become a single inserted section. The closing brace in the base is thrown away by the merge, and the inserted section is migrated to its new home - but with the extraneous closing brace.
That results in a merge that doesn't compile and looks weird.
Need to revert d5df1eb361b6c9a1e90fbe85d41142780e17274a and find a way of making the test pass as is.