fix: handle errors from credential refresh

karlkfi commented 12 months ago

Previously, errors from askpass and credential storage were being ignored, causing git clone/fetch to later error with hard-to-read errors.

k8s-ci-robot commented 12 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: karlkfi Once this PR has been reviewed and has the lgtm label, please assign janetkuo for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes/git-sync/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

thockin commented 12 months ago

This has an interesting semantic change. Consider this: the credentials function succeeds once and gives a token valid for 5 minutes. 4 minutes later it fails one time. We still have a valid credential, but git-sync doesn't KNOW that.

So should we hard-fail as this PR does, or log an error and try to sync anyway? Or maybe we should track "have I EVER had a valid credential? If not fail"? Or something else?

thockin commented 12 months ago

What about something like the below? Can you test it?

diff --git a/main.go b/main.go
index 3a562bc..cb208c4 100644
--- a/main.go
+++ b/main.go
@@ -936,11 +936,12 @@ func main() {
    }

    // Craft a function that can be called to refresh credentials when needed.
-   refreshCreds := func(ctx context.Context) error {
+   everHadCreds := false
+   refreshCreds := func(ctx context.Context) (bool, error) {
        // These should all be mutually-exclusive configs.
        if *flUsername != "" {
            if err := git.StoreCredentials(ctx, *flUsername, *flPassword); err != nil {
-               return err
+               return everHadCreds, err
            }
        }
        if *flAskPassURL != "" {
@@ -948,11 +949,12 @@ func main() {
            // re-fetched each time.
            if err := git.CallAskPassURL(ctx); err != nil {
                metricAskpassCount.WithLabelValues(metricKeyError).Inc()
-               return err
+               return everHadCreds, err
            }
            metricAskpassCount.WithLabelValues(metricKeySuccess).Inc()
        }
-       return nil
+       everHadCreds = true
+       return everHadCreds, nil
    }

    failCount := 0
@@ -1697,10 +1699,15 @@ func (git *repoSync) currentWorktree() (worktree, error) {
 // SyncRepo syncs the repository to the desired ref, publishes it via the link,
 // and tries to clean up any detritus.  This function returns whether the
 // current hash has changed and what the new hash is.
-func (git *repoSync) SyncRepo(ctx context.Context, refreshCreds func(context.Context) error) (bool, string, error) {
+func (git *repoSync) SyncRepo(ctx context.Context, refreshCreds func(context.Context) (bool, error)) (bool, string, error) {
    git.log.V(3).Info("syncing", "repo", git.repo)

-   refreshCreds(ctx)
+   if everHadCreds, err := refreshCreds(ctx); err != nil {
+       if !everHadCreds {
+           return false, "", fmt.Errorf("credential refresh failed: %w", err)
+       }
+       git.log.V(0).Info("credential refresh failed but older credentials may work, attempting to sync")
+   }

    // Initialize the repo directory if needed.
    if err := git.initRepo(ctx); err != nil {

thockin commented 11 months ago

I have patched this into v4 and written tests that tickle the issue. It's not that this converted a fail to a success, but a bad-error fail into a good-error fail.

Do you want to run with it, or shall I?

diff --git a/main.go b/main.go
index 8ddb7e9..b8664d5 100644
--- a/main.go
+++ b/main.go
@@ -936,23 +936,27 @@ func main() {
    }

    // Craft a function that can be called to refresh credentials when needed.
-   refreshCreds := func(ctx context.Context) error {
+   var lastCreds time.Time // .IsZero() = never
+   refreshCreds := func(ctx context.Context) (time.Time, error) {
        // These should all be mutually-exclusive configs.
        if *flUsername != "" {
            if err := git.StoreCredentials(ctx, *flUsername, *flPassword); err != nil {
-               return err
+               return lastCreds, err
            }
        }
        if *flAskPassURL != "" {
-           // When using an auth URL, the credentials can be dynamic, it needs to be
-           // re-fetched each time.
+           // When using an auth URL, the credentials can be dynamic, it needs
+           // to be re-fetched each time.  If the fetch fails, callers may try
+           // to overcome temporary failures by using previous credentials in
+           // the hope that they are still valid.
            if err := git.CallAskPassURL(ctx); err != nil {
                metricAskpassCount.WithLabelValues(metricKeyError).Inc()
-               return err
+               return lastCreds, err
            }
            metricAskpassCount.WithLabelValues(metricKeySuccess).Inc()
        }
-       return nil
+       lastCreds = time.Now()
+       return lastCreds, nil
    }

    failCount := 0
@@ -1720,10 +1724,15 @@ func (git *repoSync) currentWorktree() (worktree, error) {
 // SyncRepo syncs the repository to the desired ref, publishes it via the link,
 // and tries to clean up any detritus.  This function returns whether the
 // current hash has changed and what the new hash is.
-func (git *repoSync) SyncRepo(ctx context.Context, refreshCreds func(context.Context) error) (bool, string, error) {
+func (git *repoSync) SyncRepo(ctx context.Context, refreshCreds func(context.Context) (time.Time, error)) (bool, string, error) {
    git.log.V(3).Info("syncing", "repo", git.repo)

-   refreshCreds(ctx)
+   if lastCreds, err := refreshCreds(ctx); err != nil {
+       if lastCreds.IsZero() {
+           return false, "", fmt.Errorf("credential refresh failed: %w", err)
+       }
+       git.log.Error(err, "credential refresh failed but older credentials may work, attempting to sync", "lastCreds", lastCreds)
+   }

    // Initialize the repo directory if needed.
    if err := git.initRepo(ctx); err != nil {
diff --git a/test_e2e.sh b/test_e2e.sh
index 29646a1..14c7c90 100755
--- a/test_e2e.sh
+++ b/test_e2e.sh
@@ -1763,9 +1763,9 @@ function e2e::auth_askpass_url_correct_password() {
 }

 ##############################################
-# Test askpass-url where the URL is flaky
+# Test askpass-url where the URL is sometimes wrong
 ##############################################
-function e2e::auth_askpass_url_flaky() {
+function e2e::auth_askpass_url_sometimes_wrong() {
     # run with askpass_url service which alternates good/bad replies.
     HITLOG="$WORK/hitlog"
     cat /dev/null > "$HITLOG"
@@ -1777,7 +1777,7 @@ function e2e::auth_askpass_url_flaky() {
             echo
             if [ -f /tmp/flag ]; then
                 echo "username=my-username"
-               echo "password=my-password"
+                echo "password=my-password"
                 rm /tmp/flag
             else
                 echo "username=my-username"
@@ -1821,6 +1821,107 @@ function e2e::auth_askpass_url_flaky() {
     assert_file_eq "$ROOT/link/file" "$FUNCNAME 1"
 }

+##############################################
+# Test askpass-url where the URL is flaky
+##############################################
+function e2e::auth_askpass_url_flaky() {
+    # run with askpass_url service which alternates good/bad replies.
+    HITLOG="$WORK/hitlog"
+    cat /dev/null > "$HITLOG"
+    CTR=$(docker_run \
+        -v "$HITLOG":/var/log/hits \
+        e2e/test/ncsvr \
+        80 'read X
+            if [ -f /tmp/flag ]; then
+                echo "HTTP/1.1 200 OK"
+                echo
+                echo "username=my-username"
+                echo "password=my-password"
+                rm /tmp/flag
+            else
+                echo "HTTP/1.1 503 Service Unavailable"
+                echo
+                touch /tmp/flag
+            fi
+            ')
+    IP=$(docker_ip "$CTR")
+
+    # First sync
+    echo "$FUNCNAME 1" > "$REPO/file"
+    git -C "$REPO" commit -qam "$FUNCNAME 1"
+
+    GIT_SYNC \
+        --git="/$ASKPASS_GIT" \
+        --askpass-url="http://$IP/git_askpass" \
+        --max-failures=2 \
+        --period=100ms \
+        --repo="file://$REPO" \
+        --root="$ROOT" \
+        --link="link" \
+        &
+    wait_for_sync "${MAXWAIT}"
+    assert_link_exists "$ROOT/link"
+    assert_file_exists "$ROOT/link/file"
+    assert_file_eq "$ROOT/link/file" "$FUNCNAME 1"
+
+    # Move HEAD forward
+    echo "$FUNCNAME 2" > "$REPO/file"
+    git -C "$REPO" commit -qam "$FUNCNAME 2"
+    wait_for_sync "${MAXWAIT}"
+    assert_link_exists "$ROOT/link"
+    assert_file_exists "$ROOT/link/file"
+    assert_file_eq "$ROOT/link/file" "$FUNCNAME 2"
+
+    # Move HEAD backward
+    git -C "$REPO" reset -q --hard HEAD^
+    wait_for_sync "${MAXWAIT}"
+    assert_link_exists "$ROOT/link"
+    assert_file_exists "$ROOT/link/file"
+    assert_file_eq "$ROOT/link/file" "$FUNCNAME 1"
+}
+
+##############################################
+# Test askpass-url where the URL fails at startup
+##############################################
+function e2e::auth_askpass_url_slow_start() {
+    # run with askpass_url service which takes a while to come up
+    HITLOG="$WORK/hitlog"
+    cat /dev/null > "$HITLOG"
+    CTR=$(docker_run \
+        -v "$HITLOG":/var/log/hits \
+        --entrypoint sh \
+        e2e/test/ncsvr \
+        -c "sleep 4;
+            /ncsvr.sh 80 'read X
+                echo \"HTTP/1.1 200 OK\"
+                echo
+                echo \"username=my-username\"
+                echo \"password=my-password\"
+                '")
+    IP=$(docker_ip "$CTR")
+
+    # Sync
+    echo "$FUNCNAME" > "$REPO/file"
+    git -C "$REPO" commit -qam "$FUNCNAME"
+
+    GIT_SYNC \
+        --git="/$ASKPASS_GIT" \
+        --askpass-url="http://$IP/git_askpass" \
+        --max-failures=5 \
+        --period=1s \
+        --repo="file://$REPO" \
+        --root="$ROOT" \
+        --link="link" \
+        &
+    sleep 1
+    assert_file_absent "$ROOT/link"
+
+    wait_for_sync "5"
+    assert_link_exists "$ROOT/link"
+    assert_file_exists "$ROOT/link/file"
+    assert_file_eq "$ROOT/link/file" "$FUNCNAME"
+}
+
 ##############################################
 # Test exechook-success
 ##############################################

karlkfi commented 11 months ago

You can take it from here. I just wanted to get a quick fix so we could slam it into the upcoming config sync release with a code freeze next week, without having to fully test the upgrade to v4.

Honestly, I think I'm ok with it erroring immediately if the askpass call fails. In our case the askpass is a sidecar that itself hits the local metadata server, so the error rate is really low.

As long as the creds are still valid, reusing them while retrying is fine, but our askpass sidecar already handles caching and refresh before expiration, and the askpass response doesn't include the expiration. So it doesn't really make sense for git-sync to try to cache the result, because it doesn't know when it expires. Just ask askpass and let it handle the caching.

thockin commented 11 months ago

Git already "caches" the result because it is written to the gitconfig file, so in it's more "do I have any possibility of valid creds or no possibility?"

I'll port it back to v3 if you think it works.

karlkfi commented 11 months ago

The problem I see with this is that the cache never expires. So if it succeeds once and then fails forever, that failure will only ever be logged in the git-sync sidecar and not propagated to the user.

This could easily happen if IAM is changed to restrict access.

It's better than nothing, because support can find the root cause in the logs, but the user won't look there. So we'll still get the same number of support cases, they'll just be a little easier to solve.

Exposing the error propagates the error to the user which allows them to self-service a fix without contacting support.

thockin commented 11 months ago

that failure will only ever be logged in the git-sync sidecar and not propagated to the user.

If we have old, stale creds that don't work, my PR (#778) will: a) log "credential refresh failed but older credentials may work, attempting to sync" (should this go to the --error-file ?) b) fail to sync c) publish the errors to the --error-file (which was added for this purpose) d) increment the fail count, and exit when --max-failures is reached e) upon restart, no syncs will happen until creds are fetched once (logging "credential refresh failed" to --error-file)

How would you prefer it to be propagated?

We could add a flag for credential-refresh-period and stop trying to sync if we go beyond that, but I don't like to speculatively add flags and features. max-fails kind of covers that already. Thoughts?

karlkfi commented 11 months ago

Config Sync propagates errors from --error-file as "source errors" in the RootSync/RepoSync status. That status is read by the GUI/CLI. The log is mostly unseen by the user, because they usually don't know where to look, because each RSync has its own reconciler deployment, and each deployment has a half dozen containers.

The symptom today is that credential lookup fails for any number of reasons, but no error is exposed to the error file or even the log. Then the next git clone fails with a error about not being able to read the username or password from a file, which is an intermediate implementation detail that confuses Config Sync users.

So if you just log the credential refresh error and continue to write the git clone error to the error file, the Config Sync user will still get the same confusing error in the RSync source status. The reason I simple returned the refresh error was to improve the quality of the error so that the user could understand what the actual problem was and fix it themselves.

Another problem with #778 is that if the credential refresh is continually failing, but succeeded in the past, the Config Sync user won't see any error until the credentials expire. And even then the error will be that they're expired, not that the refresh failed.

It doesn't make sense to log errors when using an error file, unless the logged error is also in the error file. That would be an interface change that we would have to parse somehow, and parsing it would require some wrapper script or binary that writes to the shared disk... might as well just use the error file interface we already have.

kubernetes / git-sync

fix: handle errors from credential refresh #774