Closed sageserpent-open closed 8 months ago
Exhibit C - the merge:
Note that the referenced identifier wasn't renamed. What happened to it?
Exhibit D - the left form of the partially merged file has an orphan:
NOTE: the left and right panes of the merge triptych are the partially merged sides of the conflict, not the versions of the file as per the left and right changes.
Using git merge
yields the expected merge:
This is a diff of the file on the left branch before and after the diff, so the referenced identifier change is already on both sides.
Analysis: looking at the sections and matches in CodeMotionAnalysis
(btw, should these be logged separately from the mandatory console output via SLF4J?), what happens is that both changes are of a single token, but the one on the left comes later in the file than the one on the right.
This leads to an ambiguity - CodeMotionAnalysis
can pick a slightly longer base + left pairwise match to start with, then the change on the left, then a slightly shorter base + left pairwise match, or it can pick a slightly shorter base + right pairwise match to start with, then the change on the right, then a slightly longer base + right pairwise match.
What it won't do is try to break down the matches further into all-side matches with two separate edits, as the all-sides matches are shorter than and enclosed within the pairwise matches.
The pairwise matches overlap, however CodeMotionAnalysis
sensibly picks one alternative and goes with that - it is the two base and left pairwise matches, with a trivial section of just whenever
between them in the base and whatever
between them on the left. These suppress all attempts at breaking down the right hand side, so that it just one big section...
... so, we now have a right edit of the base up to the renaming of the referenced identifier into a single section that covers the entire right file input, followed by a left edit / right deletion conflict - one way, whenever
is edit into whatever
, the other way, it is outright deleted. Lastly, there is a right deletion of the remainder of the base.
My feeling is that merge
is doing the right thing, but with very unsuitable inputs furnished by CodeMotionAnalysis
. Intuition tells me that the breakdown into matches should have been much finer, going for three lots of all-side matches with the change from the left branch as a base + right match leading to a single-token left edit and the change from the right branch as a base + left match leading to a single token right edit.
What stopped this was the (expected) logic in CodeMotionAnalysis
that prefers a longer pairwise match to a shorter all-sides match subsumed within it. That makes perfect sense when there is a long pairwise match in isolation that subsumes an all-sides match, eg:
BASE: zzzzSAFEghghghghghghghghghgh
LEFT: zzzzSAFEghghghghghghghghghgh
RIGHT: zzzzINTRUDERghghghghghghghghghgh
NOTE: stuff in upper case denotes a single token - eg. SAFE
is one token, whereas stuff in lower case denotes sequences of tokens.
It makes no difference whether we regard this as base + left versus a single section for the right, or break it down into two lots of all-sides matches with an edit sandwiched between them. The fundamental change is in just one place, on the right.
Suppose there are the two changes that lie on either side:
BASE: zzzzSAFEghghghghghFIZZYhahahaha
LEFT: zzzzSAFEghghghghghBANGhahahaha
RIGHT: zzzzINTRUDERghghghghghFIZZYhahahaha
We can choose zzzzSAFEghghghghgh
as base + left, plus FIZZYhahahaha
as base + right, with edits to zzzzINTRUDERghghghghgh
on the right and BANGhahahaha
on the left. That makes a clean merge of zzzzINTRUDERghghghghghBANGhahahaha
.
Why didn't this happen?
There is some leading and trailing context prior to changes mentioned above in the Java merge example:
BASE: oooSAFEzzzzSAFEghghghghghFIZZYhahahahaSAFEyyyyyy
LEFT: oooSAFEzzzzSAFEghghghghghBANGhahahahaSAFEyyyyyy
RIGHT: oooINTRUDERzzzzINTRUDERghghghghghFIZZYhahahahaINTRUDERyyyyyy
This is because the right file had a global search-and-replace applied to it, so there are several edits lying on either side of the change made on the left. How does this affect things?
We get oooSAFEzzzzSAFEghghghghgh
as base + left, FIZZY
as base + right, then hahahahaSAFEyyyyyy
as base + left. So in theory that would also lead to a nice merge of oooINTRUDERzzzzINTRUDERghghghghghBANGhahahahaINTRUDERyyyyyy
.
This didn't happen - it looks as if FIZZY
was treated as being different between the base and the right …
Um, the clue is in the match threshold - which was 20%, leading to:
minimumFileSizeAcrossAllFilesOverAllSides = 5076
minimumWindowSizeAcrossAllFilesOverAllSides = 1015
maximumFileSizeAcrossAllFilesOverAllSides = 5076
minimumSureFireWindowSizeAcrossAllFilesOverAllSides = 1015
It's hardly surprising that the putative one-token base + right match for token FIZZY didn't take place!
Even worse, being ham-fisted and setting the match threshold to zero to match everything results in overlapping sections. Section overlaps are suppressed if a section is deemed to be in the small-fry below the minimum sure file size, but that won't happen in this case as the sure-fire size is the threshold size for each file being merged.
Let's do a thought experiment to see what would have happened if smaller all-sides matches had got in on the act...
BASE: oooSAFEzzzzSAFEghghghghghFIZZYhahahahaSAFEyyyyyy
LEFT: oooSAFEzzzzSAFEghghghghghBANGhahahahaSAFEyyyyyy
RIGHT: oooINTRUDERzzzzINTRUDERghghghghghFIZZYhahahahaINTRUDERyyyyyy
We'll assume the match threshold allows all the all-sides matches, but bars single-token matches for SAFE
, INTRUDER
, FIZZY
and BANG
.
That would lead to an all-sides match of ooo
, a conflicting edit (really!), an all-sides match of zzzz
another conflicting edit (equally dubious), an all-sides match of ghghghghgh
, another conflicting edit, an all-sides match of hahahaha
, yet another conflicting edit and final all-sides match of yyyyyy
.
Now if weren't for the obviously daft conflicting edits - they are really two right-edits, a left-edit and a final right-edit, then all would be well.
A tentative plan:
MatchCalculationState
when scoring a candidate window size.MatchCalculationState
with the optimal match sizes has been computed, there is some post-processing that does allow each smaller all-sides match to eat into its enclosing pairwise matches. How this works with ambiguous or overlapping matches is anyone's guess.merge
should be generalised to fall back to doing a straight comparison between two sections, even if they are not considered to match. The idea is to hope that the big matching sections can align the putative merge enough that one-token matches that would fall below the match threshold will fall out of the hat when merging.Let's think about what happens when a single large pairwise match encloses several smaller all-sides matches, some of which may be ambigious matches, possibly even with each other. Bear in mind that the section location on the side of such an all-sides match that isn't enclosed (there must be one, obviously) may have no obvious connection to the pairwise match.
Examples:
BASE: xxxxyyyyxxxx
LEFT: xxxxyyyyxxxx
RIGHT: xxxxioeruuwoexxxx
So we have a larger pairwise match of xxxxyyyyxxxx
between the base and left, and various ambiguous matches involving xxxx
. It is not really clear that either of the two xxxx
sections on the right have anything to do with the pairwise section and whether they should be thought of as being aligned. All we can hope for is to say they don't overlap, and thus the merge algorithm may decide to interpret these as preservations, doing a right-edit of yyyy
into ioeruuwoe
. That would give a merge of xxxxioeruuwoexxxx
, assuming that yyyy
is treated as a smaller base and left pairwise match - or that the final merge treats each y
is being equal tokens.
That seems clear enough, but how do we let the ambiguous all-sides matches eat into the pairwise match?
This approach is obvious:
BASE: xxxxyyyyxxxx
| |
LEFT: xxxxyyyyxxxx
| \
RIGHT: xxxxioeruuwoexxxx
So we get a reduced pairwise match of yyyy
on the base and left.
What happens if when we look at matches that cross-over the central part?
Hmm, given that all the ambiguous all-sides matches for a given window size are found in one go, that means that if a pairwise match's sections are simply eaten into on each side from all of the ambiguous matches, then the only thing remaining must be what made the larger match pairwise and not all-sides - so it is still valid as a pairwise match or matches.
In this example, we could eat into the base and left sides of the pairwise match in any order and still end up with yyyy
on both sides of the match.
Where it gets tricky is something like this:
BASE: xxxxyyyyxxxxzzzz
LEFT: xxxxyyyyxxxxzzzz
RIGHT: xxxxioeruuwoexxxxeoierp
Eating into xxxxyyyyxxxxzzzz
should yield two pairwise matches for yyyy
and zzzz
, but how does this emerge if sections are being eaten from arbitrary ambiguous matches?
Two approaches to eating away from pairwise sections:
CodeMotionAnalysis
need to be modified to not produce pairwise matches. CodeMotionAnalysisTest
.There is a third approach that I'm not keen on - this is to remove pairwise matching in CodeMotionAnalysis
and put it all on the merge algorithm to infer left-edits, left-deletions, right-edit and right-deletions. What I don't like about this is that it will stop Kinetic Merge from recognising edited moves further down the line - but that's the whole point of this exercise in the long run!
Added a bug reproduction test in Git commit SHA 3c69719c63c72085ddee215ac8205817338c7d24.
Had to edit the repeated characters in the example above to real words to avoid overlapping sections - not sure why this was necessary, need to revisit this.
The overlapping sections problem was down to having two overlapping matches (a base + left and a base + right overlapping on the base sections) that have exactly the same number of tokens, thus the same size. CodeMotionAnalysis
does not suppress overlapping sections at the same size, indeed it expects them when a candidate match is a substring of a larger optimal match.
Perhaps this should be revisited later, but for now let's park it and concentrate on this ticket.
NOTE: remember the comment here: https://github.com/sageserpent-open/kineticMerge/blob/b325f61f8c17d98ec45a487cda3ebe9a526aceef/src/main/scala/com/sageserpent/kineticmerge/core/CodeMotionAnalysis.scala#L78.
Should an empty file be represented by the absence of any sections or a single section of length zero? The former is unambiguous, whereas the latter allows an arbitrary number of zero-width sections to be packed in - in fact, that can even be the case for non-empty files.
Let's add an invariant somewhere that forces a section to be non-empty.
As of Git commit SHA c487896f9aac4266b6a0f420d1ce5a951848dca9, there is something tantalisingly close to closing this ticket. The bug reproduction test is passing, but after a fair bit of overhauling of CodeMotionAnalysis
there are now legitimate section overlap exceptions being thrown by CodeMotionAnalysisExtensionTest
when it tries to merge the example SBT sources.
The nub of the problem is the overlap between:
lazy val javaVersion = "14"
ThisBuild / version := "0.1.0-SNAPSHOT"
ThisBuild / scalaVersion := "3.3.0"
ThisBuild / javacOptions ++= Seq("-source", javaVersion, "-target", javaVersion)
lazy val packageExecutable =
taskKey[String]("Package an executable with Coursier")
lazy val
and
lazy val root = (project in file("."))
.settings(
publishTo := sonatypePublishToBundle.value,
pomIncludeRepository := { _ => false },
sonatypeCredentialHost := "s01.oss.sonatype.org",
publishMavenStyle := true,
On the left hand side there is a stanza:
lazy val packageExecutable =
taskKey[String]("Package an executable with Coursier")
lazy val versionResource =
settingKey[File]("Location of generated version resource file.")
lazy val root = (project in file("."))
The first match intrudes partway into this stanza, then there is:
versionResource =
settingKey[File]("Location of generated version resource file.")
, followed by the start of the second match.
For the base and right files the two alternative matches overlap on the lazy val
part!
The story so far as of Git commit SHA e3728d861d11dcb371c5c73afc4bcc5ab528fcb5...
All-sides matches are now allowed to be subsumed by larger pairwise matches when searching for matches; once all matches have been found, the subsuming pairwise matches are eaten into by the subsumed all-sides matches to yield a mixture of adjacent pairwise and all-sides matches.
When looking for the largest common subsequence during three-way merging, sections are now checked for equality by virtue of being in the same match or else having the same content. This picks up the smaller sections that don't make the grade when looking for matching sections, but are actually the same across the sides. It is open to debate as to whether this can be dropped if matching is allowed to proceed right down to single elements (tokens) or should be kept as a way of justifying not looking for mono/di/tri-graph section matches (which cause a lot of expensive noise when matching).
Seeing some odd looking merge results in CodeMotionAnalysisExtensionTest.merging
when setting the threshold right down to below 0.006 (0.6 %). Presumably we are down to single-token matches here, perhaps that is the reason?
CodeMotionAnalysisTest.matchingSectionsAreFound
is once again creating test cases with incorrect expectations - once again, there are coincidental match opportunities it doesn't know about; these render the test expectations invalid. Grr.
Point number three could perhaps be dealt with by banning mono/di/tri-graph matches? We do have the fallback content comparison when looking for the longest common subsequence, after all...
So it seems like we're getting there. Slowly. Diligently. Painstakingly.
As of Git commit SHA 9da79b3728ba4b0d3e8f76ba9ef1b16d4c93a7ac, have corrected test case generation in CodeMotionAnalysisTest.matchingSectionsAreFound
, but have exposed a bug in CodeMotionAnalysis
where an assumption about the presence of sections in all sides from an implied all-sides match isn't always valid when eating into a pairwise match.
What can happen is that a pairwise match subsumes a section that is in theory ambiguously matched on all-sides, but only one of the ambiguous matches completely subsumes the pairwise section. If said ambiguous match turns out to be blocked due to the threshold being too high in its file, then the eating blithely chomps on the section on one side of the pairwise match but doesn't have the corresponding section on the other side (because its match was suppressed).
Need to either make the eating of pairwise sections more robust - preferably by working at match level rather than section level, or go back to the idea of detecting all-sides matches first for all window sizes, then doing a second pass for pairwise matches, using subsumed all-sides matches to block subsuming pairwise matches in addition to the usual checks for being subsumed or overlapped.
Food for thought ...
Git commit SHA f9395e1b8a7467f4777e64f08dfd38eb0f8898af implements a revised and robust way of eating into pairwise sections; along with some major rework of MatchCalculationState
and SectionsSeenAcrossSides
- these have now merged into a simplified MatchesAndTheirSections
, which holds the matches and their sections.
Multiple ambiguous matches may reference the same section on a given side; this is robust against ambiguous pairwise matches where one pairwise match is eaten into by an all-side match but the other isn't - the implementation will generate an overlap between the full section for the pairwise match that isn't eaten into and the smaller sections for the one that is.
The bug described in the previous comment has also been fixed, and CodeMotionAnalysisTest.matchingSectionsAreFound
is back to passing with just a few rejected test cases with overlapping match section. Great.
What remains is a couple of failures in MainTest
and CodeMotionAnalysisExtensionTest
that seem to be down to overlapping sections. These need reviewing - if they are legitimate responses due to ambiguous matches, then either:
There is also a long running trial in CodeMotionAnalysisTest.matchingSectionsAreFound
- case 56 takes 4 minutes 33 seconds or so to come up with a rejected trial due what looks like a failed suppression of a digraph section...
All of the tree trial rejections seen in CodeMotionAnalysisTest.matchingSectionsAreFound
are due to subsumed sections slipping through the net and being picked up as overlaps.
So are the failures in the other tests. So it's close to being in the bag...
Tests are passing in Git SHA 34ba3bc0884c2eb8bb23416c3e9e20d5047a74f9!
CodeMotionAnalysisExtensionTest
is showing some odd-looking merge results at very low match size thresholds.
This might be due to eed679232a983ff491ac01e72a3fa9a41b6ffd44, which banned monograph matches...
Spoke too soon - the very last trial in CodeMotionAnalysisTest.matchingSectionsAreFound
- the long running one - fails. Grrr.
Also tried reinstating monograph matches - this neither fixed the failing long running trial , nor did it make the odd merges nicer.
Finally, on this most auspicuous of days, we have Git commit SHA 4c8390f77cc73e0ce0b5959d0913b597cc1fd8f7.
Behold, all the tests pass!
Some subsequent housekeeping work done as of Git commit SHA 7505c922bb3d00de9c2913346b57c21af1c42966.
Now to revisit the original example....
Merging with match-threshold=0.005
yields this as a final diff:
The merge triptych again:
This time the test is renamed on the left and the identifier is renamed on the right.
Let's try again with the branches other way around, the same way as for the original example...
The diff:
The merge triptych:
So it looks good. Now, using something like --match-threshold=0.01
yields varying merges that are both poor, but don't feature orphaned test. However, it looks like the problem for this ticket is resolved. Those poor merges can be dealt with in their own ticket...
This went out in release 0.1.10, Git commit SHA 64fd5448f2f9e6d36953c86345d069d62d1d3a2f.
Exhibit A - a change made on the left branch:
Exhibit B - a change made on the right branch:
So the renaming of an identifier referenced within a test method in A versus the renaming of the test method itself in B.