jelmer / dulwich

Pure-Python Git implementation
https://www.dulwich.io/
Other
2.06k stars 395 forks source link

stuck on can_fast_forward when trying to porcelain.pull #910

Open SaeeSaadat opened 2 years ago

SaeeSaadat commented 2 years ago

I clone my repository (with https), then create all the remote branches locally. and then: remote_refs = git_client.fetch(git_remote_address.encode(), repo) porcelain.update_head(repo, target=branch_name, detached=False, new_branch=None) porcelain.pull(repo, git_remote_address, refspecs=f"refs/heads/{branch_name}".encode(), username=git_username, password=git_password)

it just gets stuck. i traced it and found out that it get's stuck in this loop in graph.py while _has_candidates(wlst, cstates): cmt = wlst.popleft() flags = cstates[cmt] if flags == (_ANC_OF_1 | _ANC_OF_2): ... after a while the 'cmt' and the 'wlst' (great naming btw) just won't change! am i doing something wrong or is this a bug? (if i just skip the can_fast_forward and return True immediately it works fine but it's not an option in production!)

jelmer commented 2 years ago

Which version of dulwich is this?

SaeeSaadat commented 2 years ago

Which version of dulwich is this?

dulwich==0.20.26

SaeeSaadat commented 2 years ago

i just tried version 0.19.7 and it works fine there! whatever the issue is has been added since 0.19.7

jelmer commented 2 years ago

Does it get stuck or is it just slow? Do you have a traceback?

We previously didn't actually check whether history could be fast-forwarded previously, we just overwrote head - so I don't think this is a regression.

SaeeSaadat commented 2 years ago

no i'm pretty sure it got stuck in the while loop! i checked the conditions as i said and it never changed!

Does it get stuck or is it just slow? Do you have a traceback?

We previously didn't actually check whether history could be fast-forwarded previously, we just overwrote head - so I don't think this is a regression.

PRyzhov commented 2 years ago

Fixed this behaviour with this aproach:

    remote = fetch(repo)
    repo[b'HEAD'] = remote.refs[b'HEAD']
    pull(repo)
jelmer commented 2 years ago

That's not really a fix - you're just skipping the divergence checker, which is where this bug appears.

PRyzhov commented 2 years ago

That's not really a fix - you're just skipping the divergence checker, which is where this bug appears.

It's true - it's a workaround :) In my case bug appears when the local repository has its own tree and needs to fast-forward.

kevinhendricks commented 1 year ago

I have tracked this one down to the implementation of _find_lcas in graph.py not being as efficient as it can be. The previous version does find the proper lcas but is much much slower when faced with larger repos with lots of commits.

The key changes are in how the parents of a commit are handled when that parent commit has already been visited by the same descendant. Similar logic exists in libgit2's merge paint_down routine.

With these changes dulwich passes all the same tests this routine did before.

Here is the much much more efficient version of that routine that properly keeps the wlst small.

def _find_lcas(lookup_parents, c1, c2s):
    cands = []
    cstates = {}

    # Flags to Record State                                                                                                    
    _ANC_OF_1 = 1  # ancestor of commit 1                                                                                      
    _ANC_OF_2 = 2  # ancestor of commit 2                                                                                      
    _DNC = 4  # Do Not Consider                                                                                                
    _LCA = 8  # potential LCA                                                                                                  

    def _has_candidates(wlst, cstates):
        for cmt in wlst:
            if cmt in cstates:
                if not ((cstates[cmt] & _DNC) == _DNC):
                    return True
        return False

    # initialize the working list                                                                                              
    wlst = deque()
    cstates[c1] = _ANC_OF_1
    wlst.append(c1)
    for c2 in c2s:
        cstates[c2] = _ANC_OF_2
        wlst.append(c2)

    # loop until no other LCA candidates are viable in working list                                                            
    # adding any parents to the list in a breadth first manner                                                                 
    while _has_candidates(wlst, cstates):
        cmt = wlst.popleft()
        cflags = cstates[cmt]
        if cflags == (_ANC_OF_1 | _ANC_OF_2):
            # potential common ancestor                                                                                        
            if not ((cflags & _LCA) == _LCA):
                cflags = cflags | _LCA
                cstates[cmt] = cflags
                cands.append(cmt)
                # mark any parents of this node _DNC as all parents             
                # would be one level further removed common ancestors                                                          
                cflags = cflags | _DNC
        parents = lookup_parents(cmt)
        if parents:
            for pcmt in parents:
                pflags = 0
                if pcmt in cstates:
                    pflags = cstates[pcmt]
                else:
                    cstates[pcmt] = pflags
                # if cmt already visited with no new flag information                                                          
                # do not add it to the working list of possible candidates                                                     
                if ((pflags & cflags) == cflags):
                    continue
                cstates[pcmt] = cstates[pcmt] | cflags
                wlst.append(pcmt)

    # walk final candidates removing any superseded by _DNC by later lower LCAs                                                
    results = []
    for cmt in cands:
        if not ((cstates[cmt] & _DNC) == _DNC):
            results.append(cmt)
    return results

Hope this helps.

jelmer commented 1 year ago

Can you propose this as a PR?

kevinhendricks commented 1 year ago

Okay but a copy and paste of those few lines works just as easily. I will be back home on Monday evening and will create a PR. All I have access to this weekend is an ipad.

kevinhendricks commented 1 year ago

BTW your typing info for DEQUE was INT but I was unsure if that was correct. What is the base type of a commit? Isn't it a byte string of some sort? Is it okay to say it is a deque of INT?

jelmer commented 1 year ago

Yeah, you're right - that looks wrong. It should probably be Deque[bytes]

kevinhendricks commented 1 year ago

I will include that in my PR tomorrow evening. Today I compared original git logic to libgit2 logic for paint to common to find LCAs. I will incorporate some logic from those versions in the PR as well to hopefully make it more robust. I am sure they must have found and handled a number of non-obvious corner cases by now that my painting approach would be inefficient at handling.

kevinhendricks commented 1 year ago

Okay, the PR is made. Hope this helps.

DangerMouseB commented 1 year ago

Yes it definitely helps. It's still a bit slow on large repos. I've not thought about the algo too much so I don't know if it can be made even faster.

I have a commit that is 9 commits in a straight line behind master - if not (cstates[cmt] & _LCA) == _LCA: detects it okay but then it continues back through the whole of history. In my case cstates ends up with > 32k items in it.

8.4 seconds in all to do a pull. Much faster than the prior implementation. Mind you gitkraken isn't quick either.

kevinhendricks commented 1 year ago

Did you test with the above or that final version that the PR pulled in? It has a few improvements over the code piece I posted above.

FWIW, real git and libgit2 have an extra escape mechanism that basically tells this routine to stop after searching backwards for n generations. We could easily add something like that and get the results much faster for large repos where an intelligent limit could be set (ie. do not go back more than 100 generations of commits).

Or the cutoff could be done by commit time since it already exists in each commit.

DangerMouseB commented 1 year ago

I use the code from your merged PR.

Is the problem described anywhere? I may not be thinking straight but if we start off with two commits and find a node that joins the two branches (i.e. chain of parents) of those two commits then are we not done?

The time based search sounds a good avenue, if I've understood what you're saying, i.e. keep searching up the parents of the branch with the latest commit time - since there is no point in searching earlier points in time until the most recent branch has caught up.

Also, although technically more complex (because of adding an extra inner loop to select the most recent branch), presumably retrieving a commit from a repo is slow enough by comparison that, the time saved by not needing to retrieve more commits from the repo than absolutely necessary may / should make the search faster overall.

kevinhendricks commented 1 year ago

The latest real git merge techniques (recursive and ort) use recursive merges of all lowest common ancestors of both commits to create much better merges (much fewer conflicts). This requires finding all the lowest common ancestors (lcas) that are not stale (ie. not the parent of another lca).

The older git merge strategy (resolve) just requires a single lowest common ancestor to work but it should be the lowest.

So merging requires a full blown lcas search.

But can fast forward uses lcas in another way.

To determine if you can fast forward really means to verify if there is just one lca (one is a direct descendant of the other in a linear sequence of commits).

Assuming you are trying to pull in a newer version from remote, once you are earlier in commit date than than the local repo head, you should be able to stop the lcas search process as all that matters is a single straight line from the remote commit to your local commit (assuming your local has the older version of the files).

So one way to greatly speed things up to determine if fast forward is possible is to stop the lcas search once you start examining parents that are older than your local head commit date.

That is what I meant by using commit dates to know when to stop. Real git and libgit2 use this mechanism by examining what they call "generations" which are effectively commit dates to stop the search early.

I will look into adding this to the dulwich find_lcas routine, but that will take work since we are just using pre-built commit parent info and the commit sha in the search but would need to access the Commit object itself to get the commit timestamp info.

So prebuilding those somehow would be a big speedup.

Hope this explains things a bit better.

kevinhendricks commented 1 year ago

Okay, I studied the git logic and libgit2 logic for their lcas finding routines and they use a priority queue (not a fifo) and they base their priority on the max (newest) generation (which I have no idea what a "generation" really is but more importantly it falls back to commit time). The maximum commit time would be the latest commit. That is what gets processed next which would certainly be a huge improvement in performance for "can fast forward".

This means we really need to cache both commit parents and commit time alongside every commit id (sha) otherwise we will be potentially opening up and parsing huge numbers of commit files on the filesystem repeatedly every time this function is used.

Walking the tree once to build this cache would certainly help.

Right now I will switch the code to to use a priority queue and we can worry about getting commit times cached later and I will add the minimum commit time cutoff for adding to the list of commits to be examined.

Caching can come later.

kevinhendricks commented 1 year ago

Okay, as proof of concept I pushed a simple new implementation for find_lcas and can_fast_forward.

I have attached a new graph.py here.

Please give it a try, unless I am all messed up this should hugely speed things up for your large repo pull.

DangerMouseB commented 1 year ago

When you said "that will take work" I was anticipating a wait of months, and maybe having a crack at it myself. So thank you very much - hope you enjoyed doing the optimisation.

The priority queue based on timestamp sounds exactly the right sort of thing to do :)

I'll test it out by the end of next week (hopefully sooner).

kevinhendricks commented 1 year ago

Actually, I am just scratching my own itch so to speak. My project Sigil needs the ability to handle merges across different branches in dulwich. So I have created an add-on that gives that capability to dulwich. It needs a good routine to find lcas as well. So what helps you helps me.

If interested check out:

https://github.com/Sigil-EBook/dulwich_merge

That code is now passing the recursive merge libgit2 tests and should be useable. But more testing is always welcome.

kevinhendricks commented 1 year ago

Found a bug specific to can fast forward and fixed it so please try the following instead:

graph.py.zip

DangerMouseB commented 1 year ago

It goes faster and seems to work thus far. Will let you know if I come across any issues.

kevinhendricks commented 1 year ago

Thanks for letting me know it helped. A more advanced version is now part of my dulwich_merge project. Still waiting to hear if jelmer will merge any of that.

jelmer commented 1 year ago

Can you propose the LCA fixes as a PR? Either way they should be separate enough from the merge supported and landed independently.

kevinhendricks commented 1 year ago

Will do so tomorrow.