kubernetes / git-sync

A sidecar app which clones a git repo and keeps it in sync with the upstream.
Apache License 2.0
2.22k stars 411 forks source link

Add possibility to control worktree lock with a flag #832

Closed bakome closed 10 months ago

bakome commented 11 months ago

Proposal

In some cases GitSync is used with portable or NFS volumes, which by nature can be mounted and unmounted multiple times during lifetime of application.

Git documentation states in this case when using NFS or portable volume to lock the worktree.

If the working tree for a linked worktree is stored on a portable device or network share which is not always mounted, you can prevent its administrative files from being pruned by issuing the git worktree lock command, optionally specifying --reason to explain why the worktree is locked.

This feature will prevent cases explained in #827

bakome commented 11 months ago

I will prepare PR for this implementation.

thockin commented 11 months ago

Reading docs for this it sounds like it's probably not wrong, but it isn't going to do much - the worktrees are always in the same volume as the repo itself, so NFS isn't going to be improved.

To handle the case of something removing the worktree while it is in use, I think the best we can do is handle it and reconcile it back into existence, but in that case something is actively messing things up, and it will be hard to guarantee correctness.

E.g. if someone deletes a file from the local synced repo, we won't reconcile that.

So I am open to generally behaving better, but I want to limit the complexity for obvious cases of "don't do that", because we will not be perfect in the face of active chaos.

On Fri, Oct 13, 2023, 8:34 AM Aleksandar Markovski @.***> wrote:

I will prepare PR for this implementation.

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/git-sync/issues/832#issuecomment-1761714901, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVGGLZWDUL76BS3VYSLX7FNQFAVCNFSM6AAAAAA57KPEX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRRG4YTIOJQGE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

bakome commented 11 months ago

I understand that they are in same volume, but that do not appear to work. Somehow Git do not handle that well, and they are deleted which lead to the bad loop. Just to mention I am not speaking about local synced repositories, I was referring to removal of complete volume of ./worktree directory.

To be more specific this is a case example when this happens:

As you can see nothing is changed manually, but somehow I assume that Git mess up the worktrees. Just to mention i was able to repeat this multiple times on Kubernetes and with Docker Compose.

thockin commented 11 months ago

I am not speaking about local synced repositories, I was referring to removal of complete volume of ./worktree directory

One of four things happened:

1) the underlying volume is blinking, which means the root/.git directory is also gone (so locking the worktree won't matter)

2) someone actively deleted the worktrees (so locking won't matter); this was my above comment - we can try to recover but I don't think super-heroic measures are justified

3) a bug in git-sync caused (2), which is why "try to recover" is acceptable to me :)

4) You have mounted a different volume at $root/.worktrees, which seems like a terrible idea and is abusing an implementation detail.

somehow I assume that Git mess up the worktrees

This is what I'd like to understand - I would HOPE that git would hang until NFS comes back, and then proceeed. And if that is not true, I would hope that there would be an error. And if that is not true, I would hope that the next sync-period's sanity-check phase would fail, and we would start over. And if THAT is not true, I don't know. It means the worktree exists and passes fsck but is somehow corrupted? How can we detect that?

I'm not saying I don't believe you, just that I don't want to add random git operations unless I know what they do and why they matter.

bakome commented 11 months ago

Yes I agree, don't worry I respect good debate. With more debug I believe that probably you are right about the worktree, but I will do more research to be sure on that. What I am 100% sure until know is that lost of the volume cause git command to reach "context deadline exceeded" which is probably cause of this. However as I mention I will spend little more time to find the root of the issue.

What also I am sure of 100% is that the fix #828 fix this issue, but is not addressing the root cause.

thockin commented 11 months ago

Ahh, "context deadline exceeded" makes sense - there is a timeout, and if that runs out, we just abort the sync. That could leave things inconsistent, but the sanity check at the beginning of each sync look is SUPPOSED to catch this. I can't unwind every change, but I can nuke it and start over. :)

Your other patch was a great find, and I have added another testcase to cover the same failure in a different way. Now that this is merged, I wonder how much more we can/should handle. If there's a corruption case you can find that leaves an incorrect repo but still passes the sanity-check, THAT is something I want to fix.

I have a couple ideas of things to test (like ensuring that git rev-parse HEAD in the worktree gives the right result and maybe even checking for a "dirty" repo (with uncommitted changes), but I bet that breaks someone, so I doubt we can do it.

bakome commented 11 months ago

Yes, the patch fix the issue, but do not prevent it. I will try to find it in details and make a new patch. Probable we can close this issue or we can keep just for reference.

bakome commented 11 months ago

@thockin I think i figure it our how this happen. I think that the main cause is sanityCheckWorktree, but I believe that the same can happen with Repo Sanity Check too, however that cannot lead to this error because git init from start should be good. The things is in worktree sanity check git fsck took more time to finish and we end with "context deadline exceeded" which probably return false on that function. Even if we remove git fsck probably dir empty check will return false which will lead to main cause of this problem.

git.log.V(0).Info("worktree failed checks or was empty", "path", currentWorktree)
if err := git.removeWorktree(ctx, currentWorktree); err != nil {
    return false, "", err
}
currentHash = ""

The thing is that in removeWorktree the first check is

case os.IsNotExist(err):
        return nil

So in previous check we see if err is different of nil and we did not return false on sync and lead to setting up the current hash to empty string. And after that the loop start to happen.

And after this the loop every time when is doing the sanity check of repo it fails because directory was missing. However I again mention that #828 fix the issue, but I believe if I change removeWorktree function to return err when worktree path is not found it will prevent this to not happen even without #828 fix.

@thockin One question just to consider, what do u think if is good sanityCheckWorktree and sanityCheckRepo to have mechanism and fail the SyncRepo current run if "context deadline exceed" happen?

thockin commented 11 months ago

I think #828 is a correct change, anyway.

In case of a deadline we should exit as cleanly as possible, so if there are paths where that can be better, that sounds ok.

johnklehm commented 11 months ago

We're definitely hitting these issues, what's your feel for the timing of a 4.x release with this included?

thockin commented 11 months ago

I'll try to make time ASAP for a release. I think it will be 4.1.0 because of the --ssh flag change.

On Sat, Oct 14, 2023 at 12:04 PM C John Klehm @.***> wrote:

We're definitely hitting these issues, what's your feel for the timing of a 4.x release with this included?

— Reply to this email directly, view it on GitHub https://github.com/kubernetes/git-sync/issues/832#issuecomment-1763128623, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKWAVBG6XTR7IS3FFZ7Z6DX7LO3RAVCNFSM6AAAAAA57KPEX6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONRTGEZDQNRSGM . You are receiving this because you were mentioned.Message ID: @.***>

thockin commented 11 months ago

https://github.com/kubernetes/git-sync/releases/tag/v4.1.0 is currently in the promotion queue - needs an approval and then it will be published.

johnklehm commented 11 months ago

Appreciate you sir!