github / gh-gei

Migration CLI for GitHub to GitHub migrations
MIT License
328 stars 89 forks source link

Retry GraphQL mutations on `SERVICE_UNAVAILABLE` errors #707

Open tambling opened 2 years ago

tambling commented 2 years ago

We now have logic to retry all GraphQL reads in the case of a non-successful response, but there are still two issues (https://github.com/github/migration-friction/issues/689 and https://github.com/github/migration-friction/issues/701) that concern the startRepositoryMigration mutation. Since @synthead made the necessary change to the monolith to return SERVICE_UNAVAILABLE for mutations that have errored but are safe to retry, we should look for that status in the CLI to extend our retry to mutations (specifically startRepositoryMigration, which sends the SERVICE_UNAVAILABLE error).


Original comment from @synthead on #666 follows:

https://github.com/github/github/pull/241010 just shipped to prod! This change makes the Octoshift GraphQL APIs return "type": "SERVICE_UNAVAILABLE" in the error payload for failed requests. We can consume this error message when solving for this issue to do informed retries so we're not just running blind :+1:

I made sure in the PR that we only return these errors for requests that can be safely retried. This includes:

  • GraphQL queries that could not make a connection to Octoshift
  • GraphQL queries that made a connection to Octoshift, but had a dropped connection while transferring data
  • GraphQL mutations that could not make a connection to Octoshift

Note that I opted to still return a "something went wrong" failure for mutations that had their connections dropped partway because we cannot be confident that server-side transactions haven't been made.

dylan-smith commented 2 years ago

are we sure that it's always safe to retry when we get SERVICE_UNAVAILABLE? What if the migration was halfway through, then got a SERVICE_UNAVAILABLE. Does the migration clean up after itself before returning this error?

Edit: Nevermind, I see my questions were all answered in @synthead PR