jebena / jebena-python-client

Simple Python Client for the Jebena API Server
Mozilla Public License 2.0
2 stars 1 forks source link

Retry-on-timeout behavior can create a mess for non-idempotent mutations. #2

Closed this-vidor closed 3 years ago

this-vidor commented 3 years ago

Timeout does not mean that the mutation has not changed data on the server. Retrying the mutation on timeout can put server into an unintended state. This should not be the default behavior!

jpotter commented 3 years ago

Great catch! Can you fill me in with more details (either here or via Slack) about the timeout condition you saw? What was the exact error?

Background: We want our client to be resilient to transient errors -- i.e. a network hiccup, rate limiting reasons, or a request that for whatever reason crashes on the server. Essentially: if a client makes a request that fails, and re-requesting it has a reasonable chance of working (and not making things worse), then automatically doing that makes consumers of the client more reliable. This is especially important for automated cron jobs that use the client.

Issue: The current solution for this is to issue retries in certain cases. As you point out, if the first call to the API triggers a non-idempotent mutation, or if the query has two mutations where one succeeds and the second fails, then the retry logic puts the server into a potentially "bad" state that can only be recovered by the user making a manual determination.

We trigger retries in several cases:

Potential solutions:

Option 1: We could eliminate retry logic by default. This would avoid the condition, but would likely result in must users not enabling it (defaults are powerful things), thus essentially eliminating the benefit of stepping over rate-limiting (by delaying the retry in a way that would allow the second attempt to succeed), or handle the potentially-crashed request. (API servers shouldn't crash ... but FarGate memory limits have continued to have our server worked processes occasionally get killed via OOMs.)

Option 2: We could try to determine if the query is safe to replay: i.e. the query has no mutations; or the error we get from the server is clearly safe-to-replay (e.g. 429 rate-limited); etc.

I'd prefer option 2, but I don't know under what exact conditions this occurred. One thing that would help: a custom GQL query that accepts a int value for milliseconds to sleep, so we can test what's happening here with timeouts.

jpotter commented 3 years ago

Per convo with @this-vidor -- going with option 2 ... because it's easy to determine if our queries are mutations. If the first word if the query isn't "mutation", then it's not a mutation ... so we'll do retry logic on non-mutation queries. PR coming up shortly.