Closed lucaswerkmeister closed 2 years ago
And if we keep the behavior, is 1 a good default number of retries, or should it perhaps be even higher?
My mwapi
crate in Rust (very similar goals to this project!) defaults to 10 retries, but it only retries on maxlag and readonly errors rather than any request with a retry-after
header (see https://gitlab.com/mwbot-rs/mwbot/-/blob/master/mwapi/src/client.rs#L356 and https://gitlab.com/mwbot-rs/mwbot/-/blob/master/mwapi_errors/src/lib.rs#L197)
I think that results in safer / more predictable behavior at the cost of not being totally complete, e.g. it doesn't handle HTTP 429s from Varnish. That's semi-intentional because MediaWiki itself doesn't emit 429s so I'm not (yet) convinced that a "MediaWiki API" wrapper should even worry about it.
Hm, that’s a fair point. I quickly looked back at my QuickCategories code (the tool of mine with the best error handling ^^) and the only retry-later errors I handle there are also maxlag and readonly.
Although, if it’s already specific to certain errors – would it make sense to wait for a bit longer after a readonly error? (If it’s a replication lag auto-readonly, then it’s kind of a more extreme maxlag condition – IIRC the threshold for that is 60 seconds, compared to the usual maxlag=5.)
I think I’m leaning towards:
Retry-After
header if it’s presentRetry-After
for maxlag
errors even if the API doesn’t send the header (misconfigured proxy that drops the header? idk): 5 seconds?Retry-After
for readonly
errors, where the API currently doesn’t send the header: 30 seconds?maxRetries
option to limit the duration of the overall retry, not the number of retries, since the duration now varies depending on error code – for an interactive application, a single readonly
retry might already be too long to wait forreadonly
errors by default, the default max retries should be at least as long as one readonly
retry, and probably a bit more – perhaps 60 or 65 seconds? which has the added effect of automatically retrying maxlag
errors several more timesRetry-After
s for both error codes should be configurable in the request options (but not in the form of “you can react with Retry-After
to any error code” – it should be limited to maxlag
+readonly
, I think)The reason I’m thinking about this again is that #6 requires another kind of automatic retry based on error codes rather than headers (badtoken
error), though in that case there’ll be no delay. This also suggests that the new max retries limit should be based on elapsed wall-clock time (including the time the request itself takes), not just on adding up the Retry-After
delay – otherwise, badtoken
errors could be retried indefinitely, since they don’t accumulate a delay. (Alternatively, still keep a maximum number of retries, just make the number so high that it’s only a last resort but usually doesn’t apply. But I think limiting retry based on elapsed wall-clock time makes more sense.)
For “elapsed wall-clock time”, probably use performance.now()
, which seems to be a widely available monotonic clock. (process.hrtime.bigint()
is Node-specific; actual wall-clock time is not necessarily monotonic, even if we use UTC instead of local time.)
change (possibly rename) the
maxRetries
option to limit the duration of the overall retry, not the number of retries, since the duration now varies depending on error code…
I’m going with maxRetriesSeconds
for now, though I could probably be convinced to use something else. I think “retries” instead of “retry” is good to emphasize that we may retry more than once, and “seconds” clarifies the unit – but I could also see e.g. retriesLimitSeconds
or something like that.
Dammit, CI failed for these changes – because, while Node has a performance
object since version 8, it’s only available globally since version 16. Before then, you have to import it from the node:perf_hooks
package.
6e9b954898e9af37c7cd5de8ea2fd7d1c19f12d4 (first released in 0.2.0) makes m3api automatically retry API requests when receiving a Retry-After response header (i.e., if the request specified a
maxlag
parameter and database replication lag is currently above that). I’m not sure yet whether this is a good thing to do by default or not, and that decision should be made before the 1.0.0 release. If anyone has opinions on that, please leave them here!