Retry-on-timeout behavior can create a mess for non-idempotent mutations.

this-vidor commented 3 years ago

Timeout does not mean that the mutation has not changed data on the server. Retrying the mutation on timeout can put server into an unintended state. This should not be the default behavior!

jpotter commented 3 years ago

Great catch! Can you fill me in with more details (either here or via Slack) about the timeout condition you saw? What was the exact error?

Background: We want our client to be resilient to transient errors -- i.e. a network hiccup, rate limiting reasons, or a request that for whatever reason crashes on the server. Essentially: if a client makes a request that fails, and re-requesting it has a reasonable chance of working (and not making things worse), then automatically doing that makes consumers of the client more reliable. This is especially important for automated cron jobs that use the client.

Issue: The current solution for this is to issue retries in certain cases. As you point out, if the first call to the API triggers a non-idempotent mutation, or if the query has two mutations where one succeeds and the second fails, then the retry logic puts the server into a potentially "bad" state that can only be recovered by the user making a manual determination.

We trigger retries in several cases:

429: Rate limited -- in this case, the query was never sent to the worker process; retry is safe.
502/503: HTTP error -- backend server failure; if there were two mutations sent in one transaction, one could have worked and a second failed.
RemoteDisconnected: Not actually clear when this happens; would need to investigate more.
- socket.timeout: Either the server took too long to respond to a connection attempt, or the connection got shuttered. I believe the timeout setting you're seeing is for being able to connect to the server, not for the server response time. That is, the timeout we have is based on urllib.request.urlopen() -- we don't currently call sock.settimeout() for responses (although maybe we should add this as a kwarg to run_query() to allow users to specify some max time; unrelated to this issue). [NB: When called via command-line, the main() method does create a 5 minute max run-time; that's to prevent really bad queries from making it into routine usage and to protect against stuck cron-jobs in a case where the cli tool is used and does get somehow stuck.]

Potential solutions:

Option 1: We could eliminate retry logic by default. This would avoid the condition, but would likely result in must users not enabling it (defaults are powerful things), thus essentially eliminating the benefit of stepping over rate-limiting (by delaying the retry in a way that would allow the second attempt to succeed), or handle the potentially-crashed request. (API servers shouldn't crash ... but FarGate memory limits have continued to have our server worked processes occasionally get killed via OOMs.)

Option 2: We could try to determine if the query is safe to replay: i.e. the query has no mutations; or the error we get from the server is clearly safe-to-replay (e.g. 429 rate-limited); etc.

I'd prefer option 2, but I don't know under what exact conditions this occurred. One thing that would help: a custom GQL query that accepts a int value for milliseconds to sleep, so we can test what's happening here with timeouts.

jpotter commented 3 years ago

Per convo with @this-vidor -- going with option 2 ... because it's easy to determine if our queries are mutations. If the first word if the query isn't "mutation", then it's not a mutation ... so we'll do retry logic on non-mutation queries. PR coming up shortly.

jebena / jebena-python-client

Retry-on-timeout behavior can create a mess for non-idempotent mutations. #2