sindresorhus / ky

🌳 Tiny & elegant JavaScript HTTP client based on the Fetch API
MIT License
13.64k stars 363 forks source link

Support retries on certain network failures #185

Closed spruce-bruce closed 4 years ago

spruce-bruce commented 4 years ago

In poor network conditions network requests may fail in weird ways. Ky supports retrying on predictable errors (ie, based on response codes), could you support retrying on predictable network failures?

I've recently done a lot of work cataloguing the types of network failures I've seen in my production native apps using fetch, and there are two main failures that could be incorporated into your retry logic:

  1. fetch() throws "Network request failed" (occurs when a request is dropped, but also when a domain is unreachable like google.comm)
  2. response is returned without a body (stream interrupted during response i think)

It will be difficult to support these without muddying your api, though, because it is not obvious that these should always be retried. You'd have to allow users to provide some kind of a callback that allows them to make a decision based on the request.

sholladay commented 4 years ago

The second one sounds problematic to implement. But the first one sounds relatively simple. Can you explain more why you think a callback would be necessary? I'm not sure I follow what a Ky user would do to filter which network errors result in a retry.

spruce-bruce commented 4 years ago

Sure - the short answer is that when fetch throws Network Request Failed it's possible that the server received and processed the request, still. So I think Ky users would need a way to say (for a non idempotent POST request, for example) "don't retry this request because that could result in creating a new duplicate resource".

This is the same problem as number 2, the request has completed, but a user with non-idempotent requests will want to prevent retries sometimes. This could be a async callback, but if the user has to make a network request to know whether or not to retry a network request then there could be really hard to resolve cascading failures, so a non-async "yes or no" makes the most sense to me.

Frankly I'm not necessarily convinced this is something that Ky should be doing, because users who care about supporting poor networks will likely need their own retry mechanisms anyway, but I wanted to open up the discussion with you and at the very least inform the community about the types of failures that may warrant a retry that Ky may choose not to handle.

Edit: And you're right, the second would be problematic. It's hard to detect this kind of failure because of course an empty body is sometimes valid. I don't know how to implement without a second callback for deciding how to detect this failure, since the user will be the only one who can tell you whether or not an empty body is expected for any given request.

sholladay commented 4 years ago

The server is in a much better position to know whether an incoming request should be accepted than the client is to know whether an outgoing retry request should be performed. Any solution within or on top of Ky is inherently going to be imperfect when an explicit status code is missing.

I think the best we can do here is give beforeRetry hooks the ability to prevent a retry from happening (perhaps by returning a symbol). Then you could, for example, call your API to see if your POST request did actually make it through.

spruce-bruce commented 4 years ago

Yes I agree - I think we're talking about the same thing - the beforeRetry hook would be the synchronous callback that I'm referring to. Hook is a much clearer word for it, of course.

Ky can't possibly be responsible for knowing if it should be retrying in these scenarios so you must give the user control of that decision.

sholladay commented 4 years ago

@spruce-bruce do you feel that we can close this issue, given that PR https://github.com/sindresorhus/ky/pull/198 was released?

We haven't made any changes regarding which errors are automatically retried by Ky, but you can now return ky.stop in the beforeRetry hook to prevent a retry. So I think your use case is fully supported because you can filter out certain responses. Is that enough for you? Do you think there are any further improvements we could make?

spruce-bruce commented 4 years ago

@sholladay awesome! I just took a look, here's my understanding of how that PR will effect the two "retryable" failure cases that I'm thinking about:

(async () => { const parsed = await ky('https://example.com', { retry: { limit: 10, methods: ['get'], statusCodes: [200] }, hooks: { beforeRetry: [ async ({request, response, options, errors, retryCount}) => { if (/ body is not empty for a request that i know it should be present for /) / dont retry / else / retry / } ] } }).json(); })();



The semantics of that are a little weird, having to instruct ky to always retry successful requests in order to take control of filtering out the known successes and then retry the false positives, but in terms of the fundamentals of solving this problem that is correct. If a user really wants to retry those requests they must inspect every success and validate that it really did succeed, because only the application itself can know if a 200 response with an empty body is really erroneous.

I'd like to take the opportunity to reiterate that it may not be the responsibility of ky to own this problem, so I think it'd be fairly reasonable to close this issue and declare that "ky is not taking responsibility for this level of network uncertainty".
sholladay commented 4 years ago

Fetch throws an error when the request is dropped

Ky should already be retrying network failures like those. Are you saying that you ran into cases where Ky did not retry such errors?

Fetch responds with a 200 but the body is empty

As you said, logic that decides whether a 200 with an empty body is an error definitely belongs at the application layer. Frankly, I would recommend not doing this. My opinion is that it's not the client's responsibility to "second guess" the server's decision to return a 200. Even if you want to do so, and the client determines that the 200 is actually an error, I'm still skeptical that a retry is warranted. It's most likely a programmer error, not an operational error, and thus it probably can't be solved with a retry.

spruce-bruce commented 4 years ago

Ky should already be retrying network failures like those. Are you saying that you ran into cases where Ky did not retry such errors?

Ah, my mistake, I had to reorient myself to Ky in order to try and answer your question, and I misremembered this point.

It's most likely a programmer error, not an operational error, and thus it probably can't be solved with a retry.

It's certainly possible that my results can't be replicated and that I'm wrong here, but I've been able to track down this kind of error in production in a mobile app using fetch. It seems to occur when the server is streaming the response to the client and the connection is interrupted. The body can be left empty while the response headers are left intact, and fetch (or maybe just the fetch library included in React Native) will return the response with an empty body.

I'm no network technician, and of course my testing is probably flawed in a handful of ways, but this erroneous empty body problem seems real to me, and it is solved by a retry on GET requests, but not on PUT/POST/DELETE requests.

sholladay commented 4 years ago

Ah, that's interesting. Yes, when using streaming (i.e. transfer-encoding: chunked), I can see how fetch() could possibly mistakenly return a 200 with an empty body if the response was abruptly cut off at exactly the right moment. I believe it should be possible for fetch() to detect this scenario and throw an error instead of returning a Response, but I'm not going to vouch for the correctness of all fetch() implementations. If this is happening to you, my suggestion would be to first inspect the network environment. For example, Heroku imposes a time limit on how long your server can take to respond. There could be something like that going on. I can also imagine that mobile apps would be more prone to this with users switching between cell towers, etc. I would say it's worth investigating a bit deeper before adding retries for 200s to your app. But I see why it could be useful as a quick workaround.

Okay, given that there don't seem to be any other action items for Ky internals here, I'm going to close this. If you come up with any other ideas for how to improve our retry system, definitely let me know. :)