Decide whether to keep automatic retry

lucaswerkmeister commented 3 years ago

6e9b954898e9af37c7cd5de8ea2fd7d1c19f12d4 (first released in 0.2.0) makes m3api automatically retry API requests when receiving a Retry-After response header (i.e., if the request specified a maxlag parameter and database replication lag is currently above that). I’m not sure yet whether this is a good thing to do by default or not, and that decision should be made before the 1.0.0 release. If anyone has opinions on that, please leave them here!

lucaswerkmeister commented 3 years ago

And if we keep the behavior, is 1 a good default number of retries, or should it perhaps be even higher?

legoktm commented 2 years ago

My mwapi crate in Rust (very similar goals to this project!) defaults to 10 retries, but it only retries on maxlag and readonly errors rather than any request with a retry-after header (see https://gitlab.com/mwbot-rs/mwbot/-/blob/master/mwapi/src/client.rs#L356 and https://gitlab.com/mwbot-rs/mwbot/-/blob/master/mwapi_errors/src/lib.rs#L197)

I think that results in safer / more predictable behavior at the cost of not being totally complete, e.g. it doesn't handle HTTP 429s from Varnish. That's semi-intentional because MediaWiki itself doesn't emit 429s so I'm not (yet) convinced that a "MediaWiki API" wrapper should even worry about it.

lucaswerkmeister commented 2 years ago

Hm, that’s a fair point. I quickly looked back at my QuickCategories code (the tool of mine with the best error handling ^^) and the only retry-later errors I handle there are also maxlag and readonly.

Although, if it’s already specific to certain errors – would it make sense to wait for a bit longer after a readonly error? (If it’s a replication lag auto-readonly, then it’s kind of a more extreme maxlag condition – IIRC the threshold for that is 60 seconds, compared to the usual maxlag=5.)

lucaswerkmeister commented 2 years ago

I think I’m leaning towards:

continue respecting the Retry-After header if it’s present
have a default Retry-After for maxlag errors even if the API doesn’t send the header (misconfigured proxy that drops the header? idk): 5 seconds?
have a default Retry-After for readonly errors, where the API currently doesn’t send the header: 30 seconds?
change (possibly rename) the maxRetries option to limit the duration of the overall retry, not the number of retries, since the duration now varies depending on error code – for an interactive application, a single readonly retry might already be too long to wait for
since we probably want to retry readonly errors by default, the default max retries should be at least as long as one readonly retry, and probably a bit more – perhaps 60 or 65 seconds? which has the added effect of automatically retrying maxlag errors several more times
probably, the default Retry-Afters for both error codes should be configurable in the request options (but not in the form of “you can react with Retry-After to any error code” – it should be limited to maxlag+readonly, I think)

The reason I’m thinking about this again is that #6 requires another kind of automatic retry based on error codes rather than headers (badtoken error), though in that case there’ll be no delay. This also suggests that the new max retries limit should be based on elapsed wall-clock time (including the time the request itself takes), not just on adding up the Retry-After delay – otherwise, badtoken errors could be retried indefinitely, since they don’t accumulate a delay. (Alternatively, still keep a maximum number of retries, just make the number so high that it’s only a last resort but usually doesn’t apply. But I think limiting retry based on elapsed wall-clock time makes more sense.)

For “elapsed wall-clock time”, probably use performance.now(), which seems to be a widely available monotonic clock. (process.hrtime.bigint() is Node-specific; actual wall-clock time is not necessarily monotonic, even if we use UTC instead of local time.)

lucaswerkmeister commented 2 years ago

change (possibly rename) the maxRetries option to limit the duration of the overall retry, not the number of retries, since the duration now varies depending on error code…

I’m going with maxRetriesSeconds for now, though I could probably be convinced to use something else. I think “retries” instead of “retry” is good to emphasize that we may retry more than once, and “seconds” clarifies the unit – but I could also see e.g. retriesLimitSeconds or something like that.

lucaswerkmeister commented 2 years ago

Dammit, CI failed for these changes – because, while Node has a performance object since version 8, it’s only available globally since version 16. Before then, you have to import it from the node:perf_hooks package.

lucaswerkmeister / m3api

Decide whether to keep automatic retry #1