sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Allow retrying of errored changesets #12700

Closed mrnugget closed 4 years ago

mrnugget commented 4 years ago

I'm not sure whether we handle retrying correctly. We need to check that retrying publishing works if gitserver fails, GitHub is down, etc. And we need to check that the update works.

We should also probably allow retrying by simply re-applying the same campaign spec.

mrnugget commented 4 years ago

@sourcegraph/campaigns after looking into this, I think https://github.com/sourcegraph/sourcegraph/pull/12905 is as much as we can do for now.

My goal for this ticket was to make sure that users are never stuck with a failed changesets and #12905 solves that.

So:

The current state of retrying failed changesets

Users can retry the publishing/updating of changesets by applying a new CampaignSpec with new ChangesetSpecs. That means the src-cli could simply create new changesetSpecs/campaignSpecs and apply them every time the user runs src campaign apply. Users can then just re-run the command to retry.

Ideas on the future of retrying failed changesets

Idea 1: Retry on re-apply

If we end up introducing caching in src-cli (which we should at some point, because creating new changesetSpecs/campaignSpecs with the same diff is wasteful) we could say that the applyCampaign mutation receives a new side-effect: if you re-apply the same campaign spec to the same campaign, we re-reenqueue all failed changesets. That's easy enough to do by changing this

https://github.com/sourcegraph/sourcegraph/blob/c6ecbcc6a7c6435ee8c30ea0e5f2f67d2e143849/enterprise/internal/campaigns/service.go#L277-L279

into something like this:

    if campaign.CampaignSpecID == campaignSpec.ID {
        err := s.store.EnqueueChangesets(ctx, EnqueueChangesetOpts{
            CampaignSpecID: campaignSpec.ID,
            ReconcilerState: campaigns.ReconcilerStateErrored,
        })
        return campaign, err
    }

(That method, EnqueueChangesets, doesn't exist yet, but would go well with the two-column-enqueueing-approach outlined in https://github.com/sourcegraph/sourcegraph/issues/12827)

Idea 2: Automatic retry by reconciler

The reconciler could simply pick up changesets where reconciler_state = 'errored' every 60s and retry them (a) a fixed number of times (b) forever until they're completed.

There's a few twists that we can make here:

  1. Retrying uses exponential backoff (it doesn't make sense to keep on hammering the code host if we're rate limited)
  2. Retrying is only done for non-ephemeral errors (this is sketched out here: https://github.com/sourcegraph/sourcegraph/issues/12518).

All of this would require changes to the underlying workerutil package.

I think we should approach this by going from the easiest to the most sophisticated approach:

  1. Retry every X seconds
  2. Retry with exponential backoff
  3. Retry ephemeral errors

The last one is hard, because it's hard to distinguish between ephemeral and non-ephemeral errors, especiall across service boundaries (i.e. we know that a call to GitHub with a 5xx response code is retryable, but what if gitserver responsd with an error? Is gitserver running into it because GitHub is down, or is it because the patch has an invalid format?).


I think both of these ideas should be tackled in the next iteration, especially since the first one goes hand in hand with https://github.com/sourcegraph/sourcegraph/issues/12827 and the easiest version of the second one should be ™️ relatively easy to implement.

marekweb commented 4 years ago

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.19 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly. When in doubt, reach out!

Thank you

mrnugget commented 4 years ago

Just an addition to my previous comment : we also need to make sure that the retrying of importing a changeset works. Right now, if it fails, you can apply a new campaign spec but it doesn't get re-tried, since imported changesets don't have a changeset spec.

mrnugget commented 4 years ago

Closing this because #13457 and #13478 add automatic interval-based retrying to the reconciler and that solves 90% of the problems that this issue aim to address.

What we can/will do later:

What doesn't make sense:

mrnugget commented 4 years ago

Right after writing the previous comment I realised that I missed something. Here it is: https://github.com/sourcegraph/sourcegraph/pull/13591/files