Clean Merging (No code motion)

sageserpent-open commented 1 year ago

Given a Git repository of files in multiple directories that has been worked on in two branches, provide a command line utility that merges one branch into another.~~, taking into account code motion both within the files and across them.~~

~~Any divergent code motion is treated as an error and causes the merge to be aborted - the repository should be left as it was prior to the merge being attempted.~~

Any conflict is treated as an error and causes the merge to be aborted - the repository should be left as it was prior to the merge being attempted.

Nothing fancy is done with whitespace - if it causes a conflict, too bad.

Successful completion should commit a standard Git merge commit with two parent commits.

No user intervention is required - it either succeeds and commits, or fails and rolls back.

sageserpent-open commented 1 year ago

The actual merge should be done using git merge-file, supplied with temporary files for the relevant code.

sageserpent-open commented 1 year ago

To clarify the phrase from above: "the repository should be left as it was prior to the merge being attempted", the assumption here is that the working directory tree started off with no uncommitted changes (staged or not) prior to attempting the merge.

If that isn't the case, the merge isn't even started - having a clean working directory tree is a precondition for the merge.

So if an actual rollback is needed on the filesystem, it can be implemented by a simple checkout of the commit in force prior to attempting the merge.

sageserpent-open commented 10 months ago

Reducing the scope for this ticket, as it's taking a long time to implement. It would be good to get the big picture of the code organisation (notably, the application shell) delivered.

Just delivering a tool that does conventional merging is enough to get some serious implementation work out of the way:

~~Code Motion Analysis~~ DEFERRED - NO LONGER RELEVANT TO THIS STORY
~~Longest Common Subsquence~~ DONE
~~Merge~~ DONE
~~application shell~~ DONE

All of these are needed to support code motion, but can be delivered earlier in a tool that at least should provide an alternative to Git's merge.

It would be interesting to compare the results of this ticket with plain git merge to see if there is any (dis)advantage in using Kinetic Merge for basic merging.

I had considered using all the merge machinery purely to detect code motion and then let git merge-file do the final merge for the sake of consistency between clean merges and ones that need intervention for conflict resolution, as per #2. However there might be value in using the merge machinery as far as possible without involving git merge; if there are conflicts, the stage entries could be synthesized from the partial merges.

That was always the plan for code motion - git merge-file would be presented with synthetic left and right files that incorporated edits into code motion, so there is no harm in thinking about merging with synthetic left and right files.

Having an early drop of the tool with the merge machinery used to its full extent provides early feedback as to whether git merge-file is needed at all.

sageserpent-open commented 10 months ago

Got merging of sections working without tracking code motion. The next step was going to be to implement CodeMotionAnalysis, but by generalising the API of Merge to apply over arbitrary sequences instead of just sequences of Section we can apply it directly to the file contents without having to build sections.

Let's try that - a bit more work to generalise the API, but it avoids having to implement CodeMotionAnalysis in this ticket, and that will be a hard job.

sageserpent-open commented 9 months ago

Turned out that even doing a straightforward merge without sections / code motion was not so simple for real-world text.

I now know that LCS really doesn't scale well for character-level sequences, and whitespace has a nasty way of grabbing the attention of the LCS algorithm, providing a spurious contribution to the least common subsequence.

For example, take the three sentences:

BASE: "Hurricanes hardly ever happen in Hampshire", LEFT: "A bird in hand is worth two in the bush." RIGHT: "All's well that ends well."

Not only is performance dreadful when done at the character level, but the merge itself is terrible - do we really want to match on the 'i' in Hurricanes and bird? People see a sentence as a sequence of words, and source code as a sequence of keywords, identifiers and various delimiters, operators etc.

Going to a word-based approach, there is not much in common from the point of view of the words, which is what we're interested in - only in is shared between the base and left inputs - but the whitespace between the words makes a spurious LCS that breaks up what are essentially two conflict sections - the one before in and the one after.

Taking this on board:

The text is tokenised, treating it is a mix of core programming language regular expressions (identifier, string literal, numeric constant) and miscellaneous characters. This works for both sample Scala source code and English prose - words are treated as identifiers, and this is fine given that the merging has no idea of the semantics of the text or indeed of the tokens. Doing this dramatically cuts down on the size of the sequences that need merging, which really helps the LCS part.
The LCS is now run inside of a PartitionedThreeWayTransform that breaks down the three sides recursively into partititions for each input, then runs the LCS across corresponding partitions from the three sides. This in itself brings us from completely impractical LCS performance to something workable, and is the major optimisation for getting LCS over the line.
Whitespace is now folded into the preceding token in such a way that is doesn't affect the equivalence (or hashing) of a token - so the LCS is effectively computed purely from the non-whitespace tokens, but the whitespace is still preserved for both clean and conflicted merges.

sageserpent-open commented 9 months ago

The recent changes have been done as a spike in MergingTextTest. These need to be broken out and put into the non-test code folders.

As an aside, the current tokenisation implementation uses the Scala parser combinators library - the tokenisation is a bit clunky, but is by no means the performance bottleneck in a merge. If it ever gets to the point where tokenisation needs a performance boost, consider the RE2 library, trading the use of grammar productions to assemble 'super-tokens' out of more primitive tokens furnished by simple regular expressions for a more sophisticated use of regular expressions with context to build tokens directly.

Not sure if that approach is workable or even performant, but it is a possibility.

sageserpent-open commented 9 months ago

Also need to remove CodeMotionAnalysis in a nice revertible commit prior to shipping the first cut of Kinetic Merge, as it is just dead code right now.

sageserpent-open commented 9 months ago

Git cribsheet for use in application shell:

git merge-base <our branch> <their branch>. Outputs a best common ancestor commit between the two branches (may not be unique).
git diff --no-renames --name-status <best common ancestor commit> <our / their branch>. Outputs one of M, A or D, some whitespace and a path to the relevant modified / added / deleted file since the best common ancestor commit.
git ls-tree <commit id> <path>. Outputs the blob id for that path, amongst other things.
git cat-file blob <commit id>:<path>. Outputs the contents of the file at the given path as of the commit id, provided it exists.
git cat-file blob <blob id>. Outputs the contents of the file at the given path as of the commit id, provided it exists.
git update-index --add <path>. Adds the file at the path to the index.
git hash-object -t blob -w <path>. Creates a blob from a (temporary) file at the path, outputting an id for the blob.
git update-index --index-info. Allows blobs to be staged; use this for conflicts.
git merge-file -p --diff3 <destination> <base> <source>. Merges the contents of the source with destination, overwriting the destination. The merge is respect to the common base. The resulting merge goes to stdout. Use this to create a conflicted merge file distinct from the staged conflict entries.

sageserpent-open commented 9 months ago

Shell outline:

Parse a single command line argument that is the name of the source (their) branch (or just a commit SHA) to be merged into the current one. Assume that the current working directory is a valid Git directory, so there is a destination (our) branch, and that git status reports no changes either staged or waiting to be staged. If this assumption does not hold, bail out. Extraneous files neither tracked by Git, ignored or added to the stage are simply left where they are and do no participate in the merge.
Get a best ancestor commit between our and their branch.
Find out what has changed in both our and their branch wrt the best ancestor commit - record the path and whether the file at that path was modified, added or deleted in a change for the branch.
If the same path exists in both changes, then:

Modify / delete conflict leads to staging conflicted entries for the modified file and the deleted one.
Raise a severe error and abort if there is an add / delete conflict - this should not happen.
Raise a severe error and abort if there is an add / modify conflict - this should not happen.
Add / add leads to merging for that path.
Modify / modify leads to merging for that path.
Delete / delete leads to straight deletion, treat this as a no-op.

If a path exists in only one change, then treat this as a no-op if in our branch, or for modification / addition on their branch, stage the file blob according to their branch into the index, and for deletion and their branch, delete the file from the index.
All files under paths that require merging are fed to tokenisation and thence to merging, getting the file contents from the corresponding blob.
Those files that merged cleanly are added to the index, creating blobs for the merge results and updating the index from the blobs.
Those files that merged with conflicts are added to the index as stage 2 and 3 entries, creating blobs for each side of the merge. The file contents from the best ancestor commit are put into the index as a stage 1 entry. Ue git merge-file to write a partially resolved file based on the contents on the file from the best ancestor commit and the left and right merged files, so the merge is already halfway done prior to letting Git synthesize the conflicted merge.
Synchronise the working directory to the index so that git status would show only the index differences.
If all paths were either trivially merged from our branch, updated into the index from their branch or cleanly merged from both, then write a tree and then a commit with two parents. Otherwise set the MERGE_HEAD ref to their branch and report the merge as conflicted.
If there is an exception, roll back via git reset --hard. Merge conflicts are not considered to be an exceptional case.

sageserpent-open commented 9 months ago

Left unsaid so far is whether Kinetic Merge should complete the merge if no conflicts are detected. Probably best to go with with git merge does and allow a --no-commit flag to control whether the merge is left uncommitted for further review and editing. By default, the merge will be committed, and a further --commit flag can be used to override an existing --no-commit.

sageserpent-open commented 9 months ago

What do do about file permissions that have changed? Especially if they have changed on both branches?

Propagating a file permissions change made on just one branch is an obvious choice, but if there is a conflict, aborting the merge seems drastic.

Perhaps the staged conflict entries should be given the conflicting file permissions, and Git (or a third party merge) can sort it out. This means that if there are no conflcits in terms of content changes for a given file, then Kinetic Merge has to make fake conflicting entries that have the same content but different permission modes.

As for the file representing a the partial merge of the file, this is given whatever permissions git merge-file comes up with.

sageserpent-open commented 9 months ago

Git commit SHA: f2ea3c2f137fcfdc482b12a901900d721c3d6027 implements the above and has been tested manually on a toy repository.

In addition, if our branch contains their branch, this is treated as a successful no-operation. If their branch contains ours, ours is fast-forwarded to theirs as a successful operation.

If an unexpected error occurs, such as a Git command not working (contrast with when the merge operation fails in its own right, ie. conflicting edits or file addition versus file deletion conflicts), then the repository is rolled back to its initial state. The code for this is pretty hokey, but it does its job.

sageserpent-open commented 9 months ago

What's left to do?

~~Provide some feedback from the tool to the user other than just the exit code!~~ DONE
~~Add some automated tests to prove the application shell works properly.~~ DONE
~~Consider cutting over to something like Cats Effect in the shell code to get a nicer rollback.~~ DONE
~~Try out the packaged executable on some real code.~~ TRIED OUT ON A SINGLE SCALA EXAMPLE!

sageserpent-open commented 9 months ago

Time to release an executable and see if it runs on some random computer somewhere, downloading off Maven Central. I think the release ceremony diserves another ticket, as folks can build this locally for beta testing...

sageserpent-open commented 9 months ago

Evidence of a successful merge, including a file with parallel changes ....

Screenshot 2023-10-02 at 18 57 42

Git merge requires manual intervention:

Screenshot 2023-10-02 at 19 00 29

Commands executed (merge commit from Kinetic Merge was dropped from the branch via IntelliJ):

sageserpent-open / kineticMerge

Clean Merging (No code motion) #1