jelmer / dulwich

Pure-Python Git implementation
https://www.dulwich.io/
Other
2.04k stars 390 forks source link

missing 3-way merge #754

Closed kevinhendricks closed 4 years ago

kevinhendricks commented 4 years ago

I recently found the blog of James Coglin, "The If Works" See: https://blog.jcoglan.com/2017/05/08/merging-with-diff3/

Where he does a wonderful job both explaining how a diff3 style merge works and providing a basic implementation of it in ruby (I think but I have never programmed in ruby so I am unsure).

Using this blog as the basis, I created a diff3merge.py program I would be willing to contribute to dulwich under an MIT license that appears to work.

It does require a diff routine that will output all lines (not just modified lines for ease of calculating the line number offsets for chunks). I also have a python implementation of the myers_diff that does just this but it could easily be replaced by a difflib call to use ndiff while dropping the extra intraline edit hints.

I am sure it still needs work but does pass the simple test cases I have tried. It should provide a good basis for further modification and testing.

Any interest?

jelmer commented 4 years ago

This is a dupe of #452; I've been eying https://github.com/breezy-team/merge3 for use in Dulwich for a while, but have not managed to see if it can be relicensed under a license that would be compatible with Dulwich' license.

kevinhendricks commented 4 years ago

Which is why I am offering an MIT license version of the code I wrote to do a 3-way merge. Do you want it?

jelmer commented 4 years ago

On Sun, Apr 05, 2020 at 02:31:40PM -0700, Kevin Hendricks wrote:

Which is why I am offering an MIT license version of the code I wrote to do a 3-way merge. Do you want it?

I'd in particular be interested in a pull request that added merge support in Dulwich. The interesting bit of code (for Dulwich) is the code that determines which file versions to invoke the three way merge on.

That code then needs to use a 3-way merge function to do the actual merging per file; it would be great if Dulwich could bundle or depend on a 3-way merge implementation so that there is a reasonable default, but bundling a 3-way merge implementation without that integration is not very useful before then IMO.

-- Jelmer Vernooij jelmer@jelmer.uk PGP Key: https://www.jelmer.uk/D729A457.asc

kevinhendricks commented 4 years ago

So finding the ancestor at the earliest point each branch was first forked is the hard part. I would not have figured that. So closing this as well if it is not useful to you.

jelmer commented 4 years ago

On Mon, 2020-04-06 at 10:39 -0700, Kevin Hendricks wrote:

So finding the ancestor at the earliest point each branch was first forked is the hard part. I would not have figured that. So closing this as well if it is not useful to you. That's one part of it. What needs to happen is:

  1. Find merge base (the common ancestor)
  2. Determine the relevant file versions to merge
  3. Call out to a three way merge algorithm to do the merge, write conflicts, etc.

A pull request that added this would be awesome. Just a code dump of a three way merge algorithm without any of the rest isn't particularly useful - there are plenty of other 3-way merge algorithms that match those requirements out there, e.g. https://gist.github.com/stepchowfun/4713315

kevinhendricks commented 4 years ago

Yes, exchanged e-mails with Stephen and he said not to use his version in any production environment as it is "word based" and not "line based". His was the only MIT license one I could find and I did dig for one. That is why I offered mine.

Sigil will just grab the recursive merge code from git around version 1.3 or so and convert it from python 2 to 3, and make my 3-way merge code GPL (if it is still needed) and create our own internal-only fork of dulwich (under the GPL) for our needs.

Before having to do that, I really wanted to try and help move things forward for dulwich itself (at least to the point it could do a simple merge of two branches each from master at different points in time). Any octopus or prior cross merge stuff could come later. But I can clearly see that will never happen as we work in very different ways, build community in different ways, and view contributions themselves in very different ways.

Good Luck with dulwich. All the best. Kevin

jelmer commented 4 years ago

I don't think I've had a response like this before so that's caused me to reflect on it a bit...

My apologies for my lack of constructiveness on this particular issue. :-/ I really do appreciate the offer for help, and do welcome contributions.

I think my initial response to this issue was motivated in large part with our discussion about the diffstat PR, which to me felt like a code dump - i.e. "here's some code, take it or leave it". The way the first message in this issue was phrased made me worry we'd go through the same process again, and I don't think that helps either of us. But I should have called that out as a worry explicitly rather than being negative about the proposal overall.

So let me back up and rephrase my initial response. Yes, I would be interested in a merge3 implementation. It doesn't have to be perfect; it should meet the minimum bar for inclusion in dulwich documented in CONTRIBUTING, but I'm happy to help get it into that shape.

kevinhendricks commented 4 years ago

Okay ... In the end I would much rather contribute code to dulwich than have to maintain my own fork.

After the holiday, I will run some preliminary tests on it to make sure it is working and then post it in Sigil's dulwich repo so you can take a peek at it before you decide if you even want it or not.

jelmer commented 4 years ago

I've added a bit of skeleton code in the "merge-1" branch. This in particular implements the tree handling; it's still very much lacking:

kevinhendricks commented 4 years ago

Thanks. I will check it out.
I have posted the 3-way merge code here:

https://github.com/Sigil-Ebook/dulwich/blob/master/dulwich/diff3merge.py

Fair warning: it is still in very early form. It has not had flake8 run on it yet, nor does it have docstrings yet as it is really just meant for testing with at the moment. If you do end up wanting it, I will clean it up and fix any issues that way.

I had to modify it to use the official python difflib module via ndiff. Prior to this change, I have done all testing using a myers diff routine that I have in python3 but a license file was never provided to that code, so I have not added it here to be very safe license-wise.

It is amazing to me how even slight changes in the diff engine used can change the output. No wonder git has multiple diff implementations.

I will grab your merge branch and start to play around with it.

I am still getting up to speed but I have been looking at trying to find the ancestor for a specific file (target path) in two branches: I assume at its simplest it would look something like ...

  1. from ref/head get the most recent commit for branch
  2. follow commit to get tree it points to
  3. walk tree to get blob for the path of interest, get its sha1
  4. use parent of this commit next
  5. repeat steps 2 through 4 looking for changes in the path of interest's sha1
  6. when it changes we know that file was touched by that commit and record this in a commit list for that file and branch

After you have built a specific file's commit list for one branch, repeat it for the same file in the other branch.

Stop as soon as a commit is added that matches a commit in the first branch's commit list. This would be the ancestor file

Doing it for a specific file one at a time would not be efficient, but this approach could be modified to build a dictionary of commit lists keyed by file paths.

Of course renames of directories and files makes this whole thing a lot harder but I just wanted to start with a simple case first.

I am going to try this approach by hand on a test repo with two branches to see if I can at least see what needs to be done.

Hope this helps.

kevinhendricks commented 4 years ago

Grabbed your merge-1 code. Noticed the following in call to file_merger:

merged_text = file_merger(
    object_store[this_entry.sha].as_raw_string(),
    object_store[other_entry.sha].as_raw_string(),
    object_store[other_entry.sha].as_raw_string())

I think you want one of these to be the base_entry but I am unsure which order you prefer.

I used (ancestor, alice, bob) but can use any order or arguments you want. So do you want (this, base, other) or (this, other, base).

I will then modify the diff3merge.py code to accept byte strings and not paths to files and integrate it as a possible method for file_merger call.

kevinhendricks commented 4 years ago

Okay I have updated diff3merge.py to work on bytestrings only, and added an interface to use for file_merger for your merge.py code in merge-1.

See https://github.com/Sigil-Ebook/dulwich/blob/master/dulwich/diff3merge.py

I really would like to write a dedicated bytestring diff generator using the myers_diff approach that fits your license. I will look into doing that.

All of this is assuming that you wanted merge.py to use the following (ie. base last) as the interface call:

merged_text = file_merger(
    object_store[this_entry.sha].as_raw_string(),
    object_store[other_entry.sha].as_raw_string(),
    object_store[base_entry.sha].as_raw_string())

We should be able to now use dulwich/git to test the 3-way merge code (at least for the differ used by difflib).

Also how do you want to handle getting conflict info? My Merge3Way class has a get_conflicts() method that will return a list of conflicts that provide the conflict line number ranges (tuples) for the base, this, and other lines. But instead of a single call for file_merger (ie. my do_file_merge) you would need to instead pass along an instance of a file merger class that would keep state and allow your end to collect any conflict info it needed.

Either way, this should hopefully get things started to see if the diff3merge.py does what you want or not.

jelmer commented 4 years ago

Very quick reply, I'll try to reply in more detail tomorrow and put some more time into the code:

Thanks for posting the merge3 code and updating to use bytestrings.

I think you want one of these to be the base_entry but I am unsure which order you prefer. Yes, oops - thanks. I've changed the third argument to base_entry, for consistency with the top-level function.

Regarding conflicts, perhaps it would make sense for the merger to return a 2-tuple with the merged text (with conflict markers inserted) and the list of conflicts? We can then return the list of conflicts to the caller of merge(). The only thing that we really need to store in the index is whether a file was conflicted or not, so we can determine that by checking if the list of conflicts was empty or not.

kevinhendricks commented 4 years ago

I will push a new version of diff3merge.py that returns tuple. I have written a Myers diff implementation for MIT license that I will also push.

A few things to fix ...

Just now have it now doing simple merge of branches to add merged objects to the store but no resulting tree or commits. But can now use git catfile -p to verify merge is correct. It seems to be working for simple cases.

Take care,

Kevin

kevinhendricks commented 4 years ago

Okay, I just pushed a few things to Sigil-Ebook/dulwich to master:

modified:   dulwich/diff3merge.py 
          - updated to add two file_merger options and to return a tuple with conflicts

modified:   dulwich/porcelain.py
         - bug fixes for merge_base, and added a branch_merge command

modified:   dulwich/tests/test_porcelain.py
         - direct copy of your merge-1 branch version

added dulwich/merge.py
         - some fixes and some debug print added

added dulwich/myersdiff.py
        - added an implementation of myers diff algorithm for diff3merge to use

added test1.tar.gz
        - tar gzipped version of git test1 repo with just a single file "script.txt" and two
          branches "br1" and "br2" each with modified versions of script.txt that should
          merge without conflicts

add testmerge.py
        - a simple test run on the test1 repo

If I unpack test1.tar.gz right beside testmerge.py, I can run testmerge.py right in the tree.

It should give the following output:

LH4057-3970:huh kbhend$ python3 ./testmerge.py {b'br1', b'master', b'br2'} b'2ce523cc8f6afcbc802c78b2cdd1181c521ff8b1' reached merge_trees in change modify hello b'script.txt' TreeEntry(path=b'script.txt', mode=33261, sha=b'e9bbb190205651a73bb5e403e43dc65b807394a5') TreeEntry(path=b'script.txt', mode=33261, sha=b'24ad8766c2cce8f9dac864d14112d341b99026fb') TreeEntry(path=b'script.txt', mode=33261, sha=b'104466e496908b3ee1033d6976fa2aea8250fa12') in _merge_entry TreeEntry(path=b'script.txt', mode=33261, sha=b'83e8a7f2975c19d5812054d664b1f7207ccbb123')

The new TreeEntry represents the merger script.txt file.

I then use git cat-file -p to test its output

Here is the simple testcase:

ancestor (master)

This is a more complete test and a few typ0s to fix also I plan to add few lines and to remove other lines

alice (br1)

Add a line here This is a more complete test and a few typ0s to fix also I plan to add few lines and to remove

bob (br2)

This is a more complete test and a few typos to fix also I plan to add few lines and to remove other lines

Merged Result is:

LH4057-3970:test1 kbhend$ git cat-file -p 83e8a7f2975c19d5812054d664b1f7207ccbb123 Add a line here This is a more complete test and a few typos to fix also I plan to add few lines and to remove

Hope this helps.

kevinhendricks commented 4 years ago

Okay added the following simple_merge_base to porcelain.py to enable testing without any git. It uses a walker to walk each branch to build up a commit history for each branch and returns the most recent common commit it any.

def simple_merge_base(repo, committishs):
    from .walk import ORDER_DATE
    with open_repo_closing(repo) as r:
        commits = [parse_commit(r, committish).id
                   for committish in committishs]
        # build up list of commit histories from most recent to oldest
        # for each branch
        commit_histories = []
        for commit in commits:
            walker = r.get_walker(max_entries=None, include=[commit], order=ORDER_DATE)
            history = []
            for entry in walker:
                history.append(entry.commit.id)
            commit_histories.append(history[:])
    # walk first history looking for first matching commit in
    # that exists in remaining histories
    first_history = commit_histories[0]
    commit_histories = commit_histories[1:]
    for cmt in first_history:
        found = True
        for history in commit_histories:
            found = found and cmt in history
        if found:
            print(cmt)
            break
kevinhendricks commented 4 years ago

Two other little nits you might want to fix as well in merge-1 as of today:

  1. In porcelain.py line 1636 you need to pass r here not repo to make it work as find_merge_base expects a repo object not a path here:

change:

    return find_merge_base(repo, commits)

to:

    return find_merge_base(r, commits)
  1. in merge.py in line 202, you probably want to pass "file_merger" into merge_tree here and not let it default to None.

change:

    for entry in merge_tree(
            repo.object_store,
            repo.object_store[this_commit].tree,
            repo.object_store[other_commit].tree,
            repo.object_store[merge_base].tree,
            rename_detector=rename_detector):

to:

    for entry in merge_tree(
            repo.object_store,
            repo.object_store[this_commit].tree,
            repo.object_store[other_commit].tree,
            repo.object_store[merge_base].tree,
            rename_detector=rename_detector,
            file_merger = file_merger):
kevinhendricks commented 4 years ago

One question ... should we be merging tree entry blobs into tree entry blobs to create new tree entries/blobs at all? Shouldn't we just performing the blob merge and writing out the resulting merge file to the working directory instead of to a blob.

And then only automatically creating and doing a commit from the working directory if no conflict are detected?

That way if there are conflicts the files will not have been committed (no blobs added) but all would only exist in the working directory for the user to either abort or manually fix the working directory files with conflicts. And then manually perform the commit after any fixes.

So the process might look like the following:

  1. checkout "this" branch into working directory
  2. Walk the "other" branch head commit tree and try to align a tree entry blob with an existing working directory file.
  3. perform any 3 way merges required but into the working directory files only (recording if conflicts existed)
  4. If no conflicts, auto stage and commit current working directory into "this" branch.
  5. Otherwise provide list of files with conflicts and allow user to decide what to fix and when to commit.

Isn't that closer to what git really does or am I way off here?

jelmer commented 4 years ago

On Fri, Apr 17, 2020 at 06:57:29AM -0700, Kevin Hendricks wrote:

Two other little nits you might want to fix as well in merge-1 as of today:

  1. In porcelain.py line 1636 you need to pass r here not repo to make it work as find_merge_base expects a repo object not a path here:

change:

    return find_merge_base(repo, commits)

to:

    return find_merge_base(r, commits)
  1. in merger.py in line 202, you probably want to pass "file_merger" into merge_tree here and not let it default to None.

change:

    for entry in merge_tree(
            repo.object_store,
            repo.object_store[this_commit].tree,
            repo.object_store[other_commit].tree,
            repo.object_store[merge_base].tree,
            rename_detector=rename_detector):

to:

    for entry in merge_tree(
            repo.object_store,
            repo.object_store[this_commit].tree,
            repo.object_store[other_commit].tree,
            repo.object_store[merge_base].tree,
            rename_detector=rename_detector,
            file_merger = file_merger):

Thanks, applied.

-- Jelmer Vernooij jelmer@jelmer.uk PGP Key: https://www.jelmer.uk/D729A457.asc

jelmer commented 4 years ago

On Wed, Apr 22, 2020 at 07:23:36AM -0700, Kevin Hendricks wrote:

One question ... should we be merging tree entry blobs into tree entry blobs to create new tree entries/blobs at all? Shouldn't we just performing the blob merge and writing out the resulting merge file to the working directory instead of to a blob. My goal was to have merge just produce a list of changes that need to be applied to THIS. A separate function would then apply those changes to the index (or to something else).

I'd like to keep the merge logic isolated from the index on disk. That way, we have the ability to do merges purely in memory when we want to. It also makes testing a lot easier, since we do not have to have to create full repositories.

dulwich.porcelain.merge() would provide the higher level command that does other things like possibly fast-forward and that optionally commits after the operation.

And then only automatically creating and doing a commit from the working directory if no conflict are detected?

That way if there are conflicts the files will not have been committed (no blobs added) but all would only exist in the working directory for the user to either abort or manually fix the working directory files with conflicts. And then manually perform the commit after any fixes.

So the process might look like the following:

  1. checkout "this" branch into working directory
  2. Walk the "other" branch head commit tree and try to align a tree entry blob with an existing working directory file.
  3. perform any 3 way merges required but into the working directory files only (recording if conflicts existed)
  4. If no conflicts, auto stage and commit current working directory into "this" branch.
  5. Otherwise provide list of files with conflicts and allow user to decide what to fix and when to commit.

Isn't that closer to what git really does or am I way off here?

As far as I can tell, C Git also layers things. There are are lower level operations like "git merge-tree" that work similarly to dulwich.merge.merge_tree, for example.

Jelmer

kevinhendricks commented 4 years ago

Understood. I will play around trying to build a more sophisticated acyclic graph walking approach to detect the best (most recent date) ancestor commit that hopefully will works even when a commit has multiple parents (from previous merges?).

My guess is we can use a recursively built list approach to build a set of all possible parents for each branch and create a set of ancestor commits to choose among. Very similar to building and walking an HTML node tree except with multiple parents instead of multiple children. It will give me something to play around with that might end up being useful at some point.

jelmer commented 4 years ago

On Sat, Apr 25, 2020 at 08:07:41PM -0700, Kevin Hendricks wrote:

Understood. I will play around trying to build a more sophisticated acyclic graph walking approach to detect the best (most recent date) ancestor commit that hopefully will works even when a commit has multiple parents (from previous merges?).

My guess is we can use a recursively built list approach to build a set of all possible parents for each branch and create a set of ancestor commits to choose among. Very similar to building and walking an HTML node tree except with multiple parents instead of multiple children. It will give me something to play around with that might end up being useful at some point.

Yeah, it's a classic lowest common ancestor (https://en.wikipedia.org/wiki/Lowest_common_ancestor) problem.

git-merge-base(1) has some documentation on the approach git itself uses.

-- Jelmer Vernooij jelmer@jelmer.uk PGP Key: https://www.jelmer.uk/D729A457.asc

kevinhendricks commented 4 years ago

Not much detail on the specific algo used by git for merge base but from first glance it appears to be a cleaned up version of the colour and count approach from the first answer to this:

https://stackoverflow.com/questions/14865081/algorithm-to-find-lowest-common-ancestor-in-directed-acyclic-graph

They even appear to use a flags field stored in the commit object itself to store the two colours (one colour to mark ancestor of branch br1 commit, and one colour to mark ancestor of branch br2 commit, plus a count > 0 (staleness) information for dual coloured commits that that are parents of any later dual coloured commits.

I looked at dulwich's commit object in objects.py and noticed an "extra" field. Could this field be mis-used for temporary commit node colouring? Or some other field?

Or would you prefer to simply store commit node colouring in an external hash keyed by commit sha and then not need to clean up after ourselves?

Thoughts?

jelmer commented 4 years ago

On Sun, Apr 26, 2020 at 08:43:11AM -0700, Kevin Hendricks wrote:

Not much detail on the specific algo used by git for merge base but from first glance it appears to be a cleaned up version of the colour and count approach from the first answer to this:

https://stackoverflow.com/questions/14865081/algorithm-to-find-lowest-common-ancestor-in-directed-acyclic-graph

They even appear to use a flags field stored in the commit object itself to store the two colours (one colour to mark ancestor of branch br1 commit, and one colour to mark ancestor of branch br2 commit, plus a count > 0 (staleness) information for dual coloured commits that that are parents of any later dual coloured commits. How would you know when to stop colouring? Would this involve processing the entire history, which can be very large in some cases?

This is the sort of algorithm I was thinking of and that e.g. Bazaar uses: https://stackoverflow.com/a/50387246/291383

I.e. basically do a BFS from both nodes, until you find a common ancestor.

I looked at dulwich's commit object in objects.py and noticed an "extra" field. Could this field be mis-used for temporary commit node colouring? Or some other field?
The extra field is for storing extra attributes in Git commits; modifying it results in the commit's sha changing. The downside of using a field on commit itself would also be that we'd have to keep the full commit in memory.

-- Jelmer Vernooij jelmer@jelmer.uk PGP Key: https://www.jelmer.uk/D729A457.asc

kevinhendricks commented 4 years ago

Yes, they are the same basic approach, just doing both dynamically and stopping when overlap reached. The only issue is if you want all answers or just an answer.

kevinhendricks commented 4 years ago

Okay looking more closely, the flag fields that git uses to mark parentage ("colour") and "stale" properties are not part of the serialized object and therefore do not impact the sha. They are just a field of the commit class (c structure) when instantiated.

So simply changing the Commit class init routine in dulwich to add a flags value that is never serialized, would be an easy way to keep track of flags right in the commit object. Otherwise a hash from commit sha to flags value for every commit visited would work too.

Next week when I get a free moment or two, I will code up a variation of the git/stackexchange approach for you to play around with. It will walk both commits ancestries simultaneously via BFS and colouring and marking parents of lca nodes stale simultaneously. It stops when all nodes in list to process are marked at stale. It will never have to walk the entire ancestry unless there is no common ancestor at all. It will also find all lca values.

jelmer commented 4 years ago

On Sun, Apr 26, 2020 at 01:55:08PM -0700, Kevin Hendricks wrote:

Okay looking more closely, the flag fields that git uses to mark parentage ("colour") and "stale" properties are not part of the serialized object and therefore do not impact the sha. They are just a field of the commit class (c structure) when instantiated.

So simply changing the Commit class init routine in dulwich to add a flags value that is never serialized, would be an easy way to keep track of flags right in the commit object. Otherwise a hash from commit sha to flags value for every commit visited would work too. I'd rather avoid putting this into the Commit class. We could just keep a separate structure (e.g. a set with commit ids in the discovered ancestry, one per colour). There's no reason to keep the full commit objects in memory or to extend the commit interface for merge.

Next week when I get a free moment or two, I will code up a variation of the git/stackexchange approach for you to play around with. It will walk both commits ancestries simultaneously via BFS and colouring and marking parents of lca nodes stale simultaneously. It stops all nodes in list to process are marked at stale. It will never have to walk the entire ancestry unless there is no common ancestor at all. It will also find all lca values. Great, looking forward to it.

Jelmer

-- Jelmer Vernooij jelmer@jelmer.uk PGP Key: https://www.jelmer.uk/D729A457.asc

jelmer commented 4 years ago

Actually, it looks like there is a good merge base implementation here that is MIT licensed:

https://github.com/ywangd/stash/blob/master/lib/git/gitutils.py#L68

Although at a quick glance it seems to be traversing all of the ancestors.

kevinhendricks commented 4 years ago

Yes, it traverses all ancestors of both branches. First with the _collect_ancestors call on sha2 with an empty common set and then when walking all of sha1's ancestors.

It is your call. Do you want to use that one or a more efficient version?

jelmer commented 4 years ago

On Sun, Apr 26, 2020 at 04:34:56PM -0700, Kevin Hendricks wrote:

Yes, it traverses all ancestors of both branches. First with the _collect_ancestors call on sha2 with an empty common set and then when walking all of sha1's ancestors.

It is your call. Do you want to use that one or a more efficient version? I'm not too concerned about getting everything right at the moment, but I do think we should get something that is O(distance-to-common-ancestor) rather than O(size-of-history). The reason being that O(size-of-history) is going to be terrible performance-wise for the vast number of repositories out there.

That code may still be useful as a basis though or to draw inspiration from, which is why I thought it was worth pointing out.

Jelmer

kevinhendricks commented 4 years ago

Sounds good. I will try to put together a more efficient version implementation similar to what git actually does now for you to play around with.

kevinhendricks commented 4 years ago

Here is my first shot. Works on my testcase but needs lots more testing (hopefully later in the week).

https://github.com/Sigil-Ebook/dulwich/commit/466820ae836efe24b04d2b0d6dd035ef342c07c2

kevinhendricks commented 4 years ago

FWIW, what I pushed does the equivalent of merge-base --all, I can easily modify it to return just one, and can support git merge-base is-ancestor-of with just slight tweaks. I think that can be used to determine if fast forward is possible?

jelmer commented 4 years ago

Looks good :) You may want to just pass in a "lookup_parents" function to the main function, so that you can easily test it by creating a dictionary with strings in to test, and pass in something that looks up the parents from a Repository otherwise.

i.e. for tests:

parents = {'a': ['b', 'c'], 'b': [], 'c': []}
_find_lcas(parents.__getitem__, ['a', 'b'])

and then for "real code":

def lookup_parents(commit_id):
     return r.object_store[commit_id].parents

_find_lcas(lookup_parents, commit_ids)
kevinhendricks commented 4 years ago

Yes, after a few tweaks to my porcelain.py and merge_base.py code to pass in and use commit ids vs commit objects, your idea of passing in a lookup_parent function works quite well and makes the code itself fully standalone, which is nice. Especially for testing.

Once I get time later this week to run a bunch of tests, I will push the latest version to Sigil's dulwich repo.

Will you be pushing more of your local changes to merge-1 so that I can try to keep my code aligned with yours.

Thanks

kevinhendricks commented 4 years ago

Okay made merge_base.py run some tests internally based on git doc examples and it passed all of those tests. We can extract those into a separate test code base once accepted.

Tweaked merge.py to use the new merge_base.py instead of subprocess and git. Tweaked porcelain.py to add merge_base --all support and created a merge_base_is_ancestor routine as well.

Pushed all of this to Sigil-Ebook/dulwich master.

Checked out how git's merge_base --octopus works, and it appears to take the first two and find their lca, then loop through the remaining ones and using find_lcas on each with the current lca as the other input. Thereby finding the overall single lca for the entire octopus.

So adding support for the merge-base octopus should be very straight-forward given what we have now.

jelmer commented 4 years ago

On Tue, Apr 28, 2020 at 11:39:09AM -0700, Kevin Hendricks wrote:

Okay made merge_base.py run some tests internally based on git doc examples and it passed all of those tests. We can extract those into a separate test code base once accepted.

Tweaked merge.py to use the new merge_base.py instead of subprocess and git. Tweaked porcelain.py to add merge_base --all support and created a merge_base_is_ancestor routine as well.

Pushed all of this to Sigil-Ebook/dulwich master.

Checked out how git's merge_base --octopus works, and it appears to take the first two and find their lca, then loop through the remaining ones and using find_lcas on each with the current lca as the other input. Thereby finding the overall single lca for the entire octopus.

So adding support for the merge-base octopus should be very straight-forward given what we have now. Nice! I'll have a look at merging in support for merge-base later this week or this weekend.

kevinhendricks commented 4 years ago

Had some free time this afternoon and added find_octopus_base routine to merge_base.py and added a test from git docs for it. Tweaked a few things and cleaned things up. Should be ready for you to take a peak at this weekend.

jelmer commented 4 years ago

On Wed, Apr 29, 2020 at 03:07:36PM -0700, Kevin Hendricks wrote:

Had some free time this afternoon and added find_octopus_base routine to merge_base.py and added a test from git docs for it. Tweaked a few things and cleaned things up. Should be ready for you to take a peak at this weekend. Sorry, had some other things come up but I haven't forgotten about this. Currently plan is to spend some time on it during the upcoming long weekend.

kevinhendricks commented 4 years ago

Sounds good. Let me know if there is something else related to merge you want me to look at that would help and not interfere with what you are doing.

If not, I might take a look at implementing "git diff" that compares unstaged and untracked changes in the working directory to head. It is something I use a lot and from looking at the code dulwich current diff seems to be only for comparing trees, and not for comparing the working directory files to a tree (unless I missed something - which is quite possible ;) Would that be worthwhile?

Can temporary trees blobs be made that are either never added to the object store or possibly added and then removed and without an associated commit? If so, that might be a way to approach it to maximize code reuse for "git diff".

jelmer commented 4 years ago

On Tue, May 05, 2020 at 04:22:57PM -0700, Kevin Hendricks wrote:

Sounds good. Let me know if there is something else related to merge you want me to look at that would help and not interfere with what you are doing. I've done some work this weekend adding tests for merge. My hope is to add some more during the next week and to get the interface into a shape where it actually updates the working tree.

Any help to get your code into a shape that matches CONTRIBUTING.rst would be a massive help. I'm working to do this for merge-base at the moment, but I don't think I'll have time to do this for your merge3 changes any time soon.

Was there a particular reason you picked MIT for your changes rather than the dual-licensed Apachev2/GPLv2 that Dulwich uses by default? Are you happy for me to use your changes under those licenses? I can include code under a different license, but it's going to make the licensing more complicated and I'd still have to include the MIT notices in e.g. the license text.

If not, I might take a look at implementing "git diff" that compares unstaged and untracked changes in the working directory to head. It is something I use a lot and from looking at the code dulwich current diff seems to be only for comparing trees, and not for comparing the working directory files to a tree (unless I missed something - which is quite possible ;) Would that be worthwhile?

Can temporary trees blobs be made that are either never added to the object store or possibly added and then removed and without an associated commit? If so, that might be a way to approach it to maximize code reuse for "git diff". You can either create blobs in memory or store them in a MemoryObjectStore which purely exists in memory. I would usually just put them in the local object store, for simplicity.

Longer term, it may be nice to make the interface for write_object_diff() a bit more generic, e.g. to take two sets of iterators, so that you don't need to keep the entire working tree into memory and can implement those iterators on both Index and ObjectStore. Index.iterobjects() would probably be a good example of such an iterator.

Jelmer

jelmer commented 4 years ago

I have pushed some changes to the merge-base branch.

kevinhendricks commented 4 years ago

All of my code in Sigil's dulwich fork should now pass flake8, and the public functions/methods should all have google docstrings. The only thing missing are separate tests and those already exist for merge_base.py.

As for licenses, anything based on MIT licensed code, still is. I do like the MIT license for anything I write from scratch for other projects as it allows the maximum number of different open source projects to re-use the code.

As for changes/fixes in existing dulwich code modules (not separate modules written by me) you can simply assume ownership of code and license as you would if I presented them as pull requests.

Your dual licensed project should have no issues including MIT licensed code modules as it is more permissive than either of your licenses (and with no advertising clauses!).

I already implemented the major functions of "git diff" in Sigil's version of porcelain.py. It will handle diffs from commit to working directory, diffs from commit to index, diffs from index to working directory, and it subsumes your diff-tree to perform diffs between commits.

While implementing this I found and fixed a number of issues with regards to your walk_working_directory code as the existing code will not handle paths as bytes properly (searches for '.git' and not b'git' if the directory to walk is provided as a bytes). Python3 os.walk will return bytes if passed a bytes path and str if passed a str path. Also found similar bugs in ignore.py in the IgnoreManager trying to split on '/' on a bytes and not b'/'.

Also just encoding and decoding use getfilesystemencoding() is probably not really the recommended approach under python3 post 3.5. Using the existing os.fsencode and os.fsdecode might make more sense as it uses platform dependent error handling and on mac/linux 'surrogateescape' and on Windows uses strict.

The diff code does not know the encoding of any file contents (or paths) and therefore must use bytes for both so handling paths as bytes is needed.

If I get free time later this week, I will take a look at your test code and try to add some tests and try to add testcases for some of the bytes paths related issues I found.

Hope this helps.

jelmer commented 4 years ago

On Sun, 2020-05-10 at 15:37 -0700, Kevin Hendricks wrote:

All of my code in Sigil's dulwich fork should now pass flake8, and the public functions/methods should all have google docstrings. The only thing missing are separate tests and those already exist for merge_base.py. Thanks!

As for licenses, anything based on MIT licensed code, still is. I do like the MIT license for anything I write from scratch for other projects as it allows the maximum number of different open source projects to re-use the code.

As for changes/fixes in existing dulwich code modules (not separate modules written by me) you can simply assume ownership of code and license as you would if I presented them as pull requests.

Your dual licensed project should have no issues including MIT licensed code modules as it is more permissive than either of your licenses (and with no advertising clauses!). Even if MIT doesn't have an advertising clause, having code under mixed licenses is going to make other things harder, e.g. if I wanted to move some code from merge_base.py to merge.py I now have to copy the MIT license text along with it (rather than just adding a copyright line), since the license requires this:

"The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software."

Note that I'm not talking about attribution or ownership of the code, but about being able to use it under Apachev2/GPLv2 rather than MIT. Dulwich has a large number of copyright holders, those with substantial contributions you can find by grepping for "Copyright" in the code.

So while I appreciate that that is unavoidable for e.g. the myers implementation (since it's third party code), this is why CONTRIBUTING talks about specific licenses.

I already implemented the major functions of "git diff" in Sigil's version of porcelain.py. It will handle diffs from commit to working directory, diffs from commit to index, diffs from index to working directory, and it subsumes your diff-tree to perform diffs between commits.

While implementing this I found and fixed a number of issues with regards to your walk_working_directory code as the existing code will not handle paths as bytes properly (searches for '.git' and not b'git' if the directory to walk is provided as a bytes). Python3 os.walk will return bytes if passed a bytes path and str if passed a str path. Also found similar bugs in ignore.py in the IgnoreManager trying to split on '/' on a bytes and not b'/'. FWIW that's intentional - both IgnoreFilterManager and walk_working_directory are meant to be used with plain strings. Adding support for bytes (or even Path objects) to them seems like a worthwile change, but their current intended use is with regular file system paths. Is there a particular reason why you can't decode the diff path

Also just encoding and decoding use getfilesystemencoding() is probably not really the recommended approach under python3 post 3.5. Using the existing os.fsencode and os.fsdecode might make more sense as it uses platform dependent error handling and on mac/linux 'surrogateescape' and on Windows uses strict. Dulwich is still compatible with python 2.7, so we can't use os.fsencode/os.fsdecode just yet.

The diff code does not know the encoding of any file contents (or paths) and therefore must use bytes for both so handling paths as bytes is needed. Why is decoding them using the fsencoding not an option? I appreciate that this would mean decoding and re-encoding on unix, but that doesn't seem like the end of the world (since we'd be using the same encoding).

If I get free time later this week, I will take a look at your test code and try to add some tests and try to add testcases for some of the bytes paths related issues I found. Thanks!

Jelmer

kevinhendricks commented 4 years ago

So for merge_base.py and diff3merge.py please go ahead and change their license to your dual license. I will move MIT licensed versions of those files to a repo on my personal github account so that the MIT license versions are still available for anyone. I will add in the meyers diff code and the diffstat code as well there.

kevinhendricks commented 4 years ago

FWIW, I am very surprised you still want to support python 2.7 in any new code. Sigil stopped supporting Python 2.7 code over two or 3 years ago and removed all support for python 2.7 plugins from Sigil.

Without os.fsdecode and os.fsencode and without the python 3.x surrogateescape features, you can end up with a bytes path from a linux box that can NOT be represented as a unicode string object. And of course on Windows you can end up with getfilesystemencoding returning mbcs and replacing all unknown characters with ?.

So under Python 2.7 there is no version that will work correctly for all platforms. After Python 3.something, they set utf-8 as the Windows default getfilesystemencoding so that bytes paths can be used properly on Windows. And they added the surrogateescape error handler to allow linux raw byte paths to be represented as "unicode" strings.

The issue is all dulwich stores seem to be keyed to "tree_path"s which are a repo relative posixpaths in bytes. These have to be created from user supplied Windows, Linux, and mac paths.

So how exactly do you want to handle those conversions for each platform that will actually work properly across Python 2.7 (with its silly 'mbcs' default getfilesystemencoding for Windows) and still allow all valid byte patterns that are possible for linux paths (so no 'utf-8' encode/decode with strict or replace error handlers)?

jelmer commented 4 years ago

On Sun, May 10, 2020 at 08:42:02PM -0700, Kevin Hendricks wrote:

So for merge_base.py and diff3merge.py please go ahead and change their license to your dual license. I will move MIT licensed versions of those files to a repo on my personal github account so that the MIT license versions are still available for anyone. I will add in the meyers diff code and the diffstat code as well there. Thanks!

jelmer commented 4 years ago

On Mon, May 11, 2020 at 05:25:29AM -0700, Kevin Hendricks wrote:

FWIW, I am very surprised you still want to support python 2.7 in any new code. Sigil stopped supporting Python 2.7 code over two or 3 years ago and removed all support for python 2.7 plugins from Sigil. I'm keen to drop Python 2 support, but there are at least two applications that uses Dulwich that still support Python 2 - hg-git and Launchpad. Launchpad could probably stick to an older version of Dulwich, but Mercurial has only just put its first Python 3-compatible release out IIRC.

There is a python3-only branch in Dulwich that drops Python 2 support and has a number of other simplifications.

Without os.fsdecode and os.fsencode and without the python 3.x surrogateescape features, you can end up with a bytes path from a linux box that can NOT be represented as a unicode string object. And of course on Windows you can end up with getfilesystemencoding returning mbcs and replacing all unknown characters with ?.

So under Python 2.7 there is no version that will work correctly for all platforms. After Python 3.something, they set utf-8 as the Windows default getfilesystemencoding so that bytes paths can be used properly on Windows. And they added the surrogateescape error handler to allow linux raw byte paths to be represented as "unicode" strings.

The issue is all dulwich stores seem to be keyed to "tree_path"s which are a repo relative posixpaths in bytes. These have to be created from user supplied Windows, Linux, and mac paths.

So how exactly do you want to handle those conversions for each platform that will actually work properly across Python 2.7 (with its silly 'mbcs' default getfilesystemencoding for Windows) and still allow all valid byte patterns that are possible for linux paths (so no 'utf-8' encode/decode with strict or replace error handlers)? Agreed on all accounts - it would be nice to use surrogateescape, and we should do this once we drop Python 2 support. Patches for moving the python3-only branch to fsencode/fsdecode would be great.

Jelmer

kevinhendricks commented 4 years ago

FYI, I just today modified diff3merge and myersdiff.py to fix all flake8 remaining issues, and to put back python 2.7 support and pushed them earlier today to Sigil-Ebook/dulwich master.

So you may want to grab current versions from there.

Changes were made to ignore.py to handle both bytestrings and str. Changes were also made to patch.py to support the various versions of “git diff” and of course to porcelain.py for the same reason.

Feel free to grab any pieces you might be interested in and assume ownership of them.

I am going to try and generate standalone test cases for diff3merge and related code by studying your test_ code when I get more free time later this week.

Sent from my iPhone

On May 11, 2020, at 6:02 PM, Jelmer Vernooij notifications@github.com wrote:

 On Mon, May 11, 2020 at 05:25:29AM -0700, Kevin Hendricks wrote:

FWIW, I am very surprised you still want to support python 2.7 in any new code. Sigil stopped supporting Python 2.7 code over two or 3 years ago and removed all support for python 2.7 plugins from Sigil. I'm keen to drop Python 2 support, but there are at least two applications that uses Dulwich that still support Python 2 - hg-git and Launchpad. Launchpad could probably stick to an older version of Dulwich, but Mercurial has only just put its first Python 3-compatible release out IIRC.

There is a python3-only branch in Dulwich that drops Python 2 support and has a number of other simplifications.

Without os.fsdecode and os.fsencode and without the python 3.x surrogateescape features, you can end up with a bytes path from a linux box that can NOT be represented as a unicode string object. And of course on Windows you can end up with getfilesystemencoding returning mbcs and replacing all unknown characters with ?.

So under Python 2.7 there is no version that will work correctly for all platforms. After Python 3.something, they set utf-8 as the Windows default getfilesystemencoding so that bytes paths can be used properly on Windows. And they added the surrogateescape error handler to allow linux raw byte paths to be represented as "unicode" strings.

The issue is all dulwich stores seem to be keyed to "tree_path"s which are a repo relative posixpaths in bytes. These have to be created from user supplied Windows, Linux, and mac paths.

So how exactly do you want to handle those conversions for each platform that will actually work properly across Python 2.7 (with its silly 'mbcs' default getfilesystemencoding for Windows) and still allow all valid byte patterns that are possible for linux paths (so no 'utf-8' encode/decode with strict or replace error handlers)? Agreed on all accounts - it would be nice to use surrogateescape, and we should do this once we drop Python 2 support. Patches for moving the python3-only branch to fsencode/fsdecode would be great.

Jelmer — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or unsubscribe.

kevinhendricks commented 4 years ago

Because of all of the issues with path conversions across python 2.7 and python 3.x under linux, mac, and windows, I have created an implementation of the equivalent of os.fsencode and os.fsdecode (and surrogateescape for decode) for python2.7 that will invoke invoke os.fsencode and os.fsdecode under python 3.X.

See fsutils.py (in Sigil-Ebook/dulwich master) which provides the following functions under both python 2.7 and 3.X:

bytes_path(apath) - which os.fsencode or its equivalent under python 2.7

unicode_path(apath) - which does os.fsdecode or its equivalent under python 2.7

printable_path(apath) - which will prevent errors when printing unicode paths with surrogates

I put it under an MIT license for now but if you decide you would like to go with this approach in dulwich, you are free to use it under dulwich's dual license scheme.

I would very much like to adopt this code for use in the implementation of "git diff" when converting paths from the working directory walk to tree_paths needed for object_store keys to for the index and for commit trees entry manipulations.

It could also be used in a number of places throughout dulwich to implement os.fsencode and os.fsdecode now without giving up support for python 2.7 (until you want to do that).

Anyway, give it a look-see and let me know if this is something you think you might be interested in.