Suggest query batch sizes and HTTP concurrency - Githubissues

reconciliation-api / specs

Specifications of the reconciliation API

https://reconciliation-api.github.io/specs/draft/

31 stars 9 forks source link

Suggest query batch sizes and HTTP concurrency #66

Closed wetneb closed 3 weeks ago

wetneb commented 3 years ago

Reconciliation clients can submit reconciliation query batches but there is no clear indication of how big these batches should be. Also, there is no clear indication of how many parallel HTTP requests should be fired at a given service.

The specs could be more precise on this, perhaps by letting services announce in their manifests some maximum or recommended values for these settings.

Corresponding OpenRefine issues: https://github.com/OpenRefine/OpenRefine/issues/2549, https://github.com/OpenRefine/OpenRefine/issues/3657.

tfmorris commented 3 years ago

"Concurrency" is just, in my opinion, a special case of rate limiting, which is already covered by #40 (and #43).

I'm not sure anything that is service level related should go in the manifest, since it may vary by the service class of the user, be load related, or be specified out of band.

If there's a maximum batch size, I guess I could see adding that to the manifest (although it's also affected by a variety of HTTP parameters), but would it affect any clients' behavior? They are more likely to choose a batch size which is in the sweet spot that matches their desires for latency, reliability, etc which is probably much smaller than any reasonable maximum.

More generally, perhaps we should try to standardize some core functionality, get some experience with it, and then see what's high priority to add.

gitonthescene commented 3 years ago

Concurrency is about parallelism and rate limiting is not. I agree that the service doesn't have much control over what a client may attempt and that rate limiting is one way to handle a poorly acting client. On the other hand, the service manifest is one method that the server can declare what it believes is acceptable from a client.

Different services may have different levels of tolerance for load and so the client doesn't need to treat all servers alike. Currently, OpenRefine sends reconciliation requests with a maximum batchSize of 10 regardless of the server. Similarly, OpenRefine sends only one request at a time regardless of how long it takes to process a request. Concurrency allows services to work on multiple requests in parallel.

The server doesn't necessarily know that the requests came from the same client without sessions, but can rate limit based on IP address. In general, I don't think the service should expect all clients to be good actors and should be configured defensively and similarly the client shouldn't make assumptions about what it takes to be a good actor.

The client should take hints from the service via the service manifest and react to errors with a back-off on retries.

gitonthescene commented 3 years ago

FWIW, my vision is that you could spin up scalable reconciliation services with some cloud service and keep them up as long as necessary. So for large jobs you'd want to take advantage of parallelism to speed up the job. This could/should be nearly transparent to the client. People could even take some pressure off large reconciliation servers by downloading a subset to reconcile against and spin up their own private service scaled as necessary. With modern cloud services this could be done with minimal cost.

fsteeg commented 3 years ago

We briefly touched this issue in yesterday's meeting, discussing if an HTTP status code like 429 (Too Many Requests) could be a reasonable way for services to communicate to clients that they should throttle their requests. This could be both in a fallback scenario for exceptional situations (e.g. service is out of resources) and as a way to dynamically negotiate batch sizes and concurrency (client tries large/many requests, reduces until accepted by the service).

gitonthescene commented 3 years ago

That sounds reasonable to me. Services could add such handling by using a proxy as mentioned in #40 and don't necessarily need to hand implement it directly into the service.

It might be worth pointing out that with that nginx proxy you can choose the error code to return. I imagine there are other such proxies with similar configurability. This is just what I found from googling around.

gitonthescene commented 3 years ago

Tangentially related, I just learned of this tool for testing server load handling.

mixterj commented 3 years ago

separating concurrency and rate limiting are important but for concurrency, I think it will be hard, if not impossible to address that in the API specification. Concurrency is an implementation detail based on the servers being used to host the Recon API. So, I do not have a very strong interest is seeing any reference to it in the spec. But, for rate limiting, I think the HTTP 429 response makes sense. We could suggest a response for the body of the 429 response if it is helpful.

gitonthescene commented 3 years ago

@mixterj - The suggestion at the start of this issue is to put the allowed max size of the batch in the manifest. This doesn’t directly address concurrency but if the service is only receiving ten queries at a time there’s likely not much to be gained in executing those ten queries in parallel. However, if the service announced that it could handle 1000 queries in a batch, that request could be split up and executed concurrently on the service side and joined before replying. The original motivation for this issue was to be able to scale a service horizontally.

This is orthogonal to how many of these batches the client fires off in a given time period. The service likely won’t distinguish getting lots of requests from a single client from getting lots of requests from multiple clients. In any event, this aspect is more about rate limiting than executing requests in parallel.

epaulson commented 2 years ago

I think there are at least 3 parameters that we should put into 'serverLimits' section of the manifest (which is what it looks like we're settling on in #67).

For terminology, when I say 'submit a batch', I mean submitting an HTTP request to the service endpoint with a batch of queries. If a batch has 10 queries, if you submit 2 batches, you make 2 HTTP POST requests to the endpoint, for a total of 20 queries. Right now OpenRefine waits until the first batch is done before sending the second batch, but that's just OpenRefine's behavior.

Anyway, the three things I think we should put into the service manifest under the serverLimits section:

First, a 'Maximum Batch Size' - right now OpenRefine is hard-coded to send up to 10 queries in a batch - it really should send more, but the server should be able to say what the biggest number of queries in a single batch it will accept. I think 1000 is a reasonable limit for most servers - there's probably not a good reason why a real service would have this lower, but certainly a service could make this bigger. We should put in the specs that if this parameter isn't present, the default is 10, to match existing OR behavior. If a client request exceeds the 'Maximum Batch Size', the server will respond with HTTP error code 413 - Payload Too Large (https://developer.mozilla.org/en-US/docs/Web/HTTP/Status ) (Though 413 is a bit obscure and just responding with 400 with maybe a more detailed message in the body seems fine too)

Second, there should be a 'Request Rate Limit' or maybe ‘maximumRequestRate’, which is the number of batch requests in a given time interval a client can make. I think a reasonable limit here might be 10 batch requests a second, but maybe it should be lower, like 1 a second.

(An aside: If there was a 10 batch requests/second rate limit, I'll bet it's unlikely but possible that OpenRefine could actually exceed that today, even with the current OR behavior of waiting until a batch completes before sending the next batch - if the reconciliation service was very fast, I suspect it's possible that OpenRefine could make at least 10 batch requests over the course of a second. However, given the overheads in making a connection, sending a batch request, doing all of the processing on the service to handle that batch of queries and sending the results of that batch of queries back, and openrefine updating the project, while it may be possible, I'm sure it's very unlikely that OpenRefine as it stands today actually could exceed 10 batch requests in a second)

A request rate limit setting will matter more if OpenRefine is enhanced and starts to issue requests in parallel, in which case OpenRefine will need to incorporate some kind of internal queuing to make sure that when it generates batches of queries, OpenRefine doesn’t make all of the batch requests immediately and instead makes sure that it only dispenses/doles out batch requests at the limited rate.

Request Rate Limit is usually measured on a per-client basis, e.g. the number of requests from a single IP address per second. If the client exceeds that threshold, the service should respond with an HTTP response code 429 Too Many Requests. You can implement this a couple of ways - using something in your HTTP serving layer like NGINX is an option - see https://www.nginx.com/blog/rate-limiting-nginx/ https://docs.nginx.com/nginx/admin-guide/security-controls/controlling-access-proxied-http/ https://serverfault.com/questions/899122/nginx-request-limit-timeout-behavior-for-queries-in-the-queue-burst

Or you could do it at a slightly higher level, in the server itself: https://flask-limiter.readthedocs.io/en/stable/index.html

There’s also HTTP response code 503 Service Unavailable, which is really meant for when the service OVERALL is limited, and it’s not the client’s fault for sending more requests than it was allowed to send. Some non-recon APIs use 503 instead of 429. You could use NGINX to do both global and per-client-ip limiting, or use NGINX for global and flask-limiter for per-client-ip limiting.

Both 503 and 429 response codes can include a ‘Retry-After’ header, which give the client some idea of how long to wait before responding.

To be safe, I think the API spec should say that if the maximumRequestRate value is missing from the manifest, the default should be assumed to be 1 request per second.

The API spec should be updated to include a note to implementers on how to handle 503 and 429 responses and Retry-After.

(Also note: I’ll come back to it in just a bit, but for the moment I’ve been ignoring service request time, i.e. if we issue a batch request from OpenRefine with say 1000 queries in the batch, that will probably take a couple of seconds for the service to complete that batch. Obviously, if we keep issuing 10 batch requests a second continuously, and each batch takes more than 1/10th of a second to complete, we’ll start to build up a queue of queries to complete and it will always grow, which is bad)

The third parameter that we should add is a ‘maximumQueriesInProgress’ setting. Note that this limit is for overall queries, not batches! So, if the maximumQueriesInProgress (MQIP from here on out) is set to 10,000 (which I think is a reasonable value?), and OpenRefine has already made 10 batch requests, each with 1000 queries in them, and none of them have completed yet, if OpenRefine makes an 11th batch request via an HTTP POST, the service should respond to that POST with HTTP Response 429.

You could split that MQIP of 10,000 requests in different ways, so long as you fit under the MaximumBatchSize setting in each batch and you didn’t issue them faster than the RequestRateLimit - perhaps 100 batches of 100 queries each, issued over the course of 10 seconds.

I don’t think it would be easy to Implement MQIP at the service using just NGINX configuration. I think it’s possible to use something like flask-limiter to implement MQIP in a service, though with a bit of code and not just a concise decorator like flask-limiter’s examples.

The API spec should say that if the MQIP setting is missing from the service manifest, the default value should be 10. That’s way too small for a serious service, but it matches what OpenRefine does today, so we know services can handle that.

In #80, there’s a ‘maxConnections’ setting, and if that was merged, you could think of MQIP = maxConnections * maximumBatchSize. (maxConnections is defined as ‘The number of TCP connections which can be concurrently kept open from a single IP address’) However, I think it’s useful to capture the MQIP as an explicit setting, because some clients might want to use more connections with smaller batch sizes in each batch request, so they complete faster, so we don’t want to set the ‘maxConnection’ limit too small, but if we make ‘maxConnections’ too big, if ‘maximumBatchSize’ is reasonably large, we can overwhelm the server. Creating ‘MQIP’ as a setting gets to a key threshold we want to manage.

I think it’s reasonable to say ‘forget maxRequestRate’ and instead just keep maxConnections, MQIP, and MaximumBatchSize - the nice thing about maximumRequestRate is it stops us from getting hammered with little requests, or we can say (using something like flask-limiter) more advanced configuration, like ‘no more than 10 requests a second, but also no more than 1000 requests in a day’.

If we did keep ‘maxConnections’, the API specs should say that if the setting is missing from the service manifest, the default value should be 1, to match what OpenRefine does today.

At any rate, I think you need at least 3 settings, and not just two: we should have maximumBatchSize, maximumQueriesInProgress, and then at least one of ‘maxConnections’ or ‘maximumRequestRate’

(One more thing that might be nice is to have ‘recommended’ versions of those settings, alongside their ‘maximum’ settings. I think that’s probably most useful for maximumBatchSize - a service might say “we recommend you pass us 100 queries in a batch, but we have a hard limit of 1000”, but I guess I’m not sure that there’s really any good reason not set maximum and recommend to the same value)

thadguidry commented 2 years ago

Thanks for your thoughts on this @epaulson. I've been thinking more that some of this doesn't necessarily need to be in a manifest. It could be in an HTML document like many web service providers do, that list the daily or hourly quotas. The API's of the service then are responsible for general service responses that return HTTP "too many requests", "overlimt", etc. Rather than any quota mechanisms built into the reconciliation-api manifests themselves.

So in the end, it would be where do services want to really document their quotas (free tiers or paid), in HTML docs? or in the Manifests? I'd likely say that HTML docs will win out. But downside is that there needs to be an additional web service to serve documentation rather than the recon service being somewhat self-documenting.

I think the idea that @wetneb stated in the first comment tries to ask if giving machines (recon clients) ways to gather the quota or limits information themselves. That might be a neat way to handle this, but it also might just be easier to expect that there is still a human on the other side who is reading the service docs and finding the quota limits and then configuring the client as necessary to respects those limits.

I'm still on the fence on this issue if it's really necessary to have self-documentation for recon services when it comes to client quotas.

Maybe guidance is all that is needed in the reconciliation-api docs concerning parallel requests and to check with providers documentation for their quotas and then configure clients.

gitonthescene commented 2 years ago

If the client limits the number of requests it sends or the size of the batches it sends there is little the service can do to improve performance. If you are suggesting ramping up what the client sends and leaving it for services to deal with the increased volume then perhaps servers can work out how to achieve best performance. Having some negotiation beforehand via the manifest is a middle ground which seems reasonable to me. If nothing changes in the client then the services will always be at the mercy of the trickle of requests it sends.

gitonthescene commented 2 years ago

Of course people can set up alternative workflows which perform reconciliation through their own clients, side-stepping the limitations in OpenRefine but somehow encouraging that doesn’t seem to be in anyone’s interest.

Because it’s possible, though, it still behooves server authors to be defensive against excessive volume regardless of any negotiation via the manifest.

fsteeg commented 2 years ago

Trying to summarize our discussion of this in our May meeting:

Communicating limits in an HTML page (as mentioned in https://github.com/reconciliation-api/specs/issues/66#issuecomment-1100794188) has the downside that a client could not optimize automatically for maxing out different services
There seems to be consensus that we want a way to specify a maximum batchSize, indicating to clients how many queries should be sent in a single batch
There seems to be consensus that a service should respond with an HTTP code 429 response (Too Many Requests) if a client exceeds the service's capacity

So to move forward, we could add the two items that we agree about (batchSize and 429 responses) and see what else we need in the future.

Since we'd be adding a single value, it should probably not go into a separate sub-object in the manifest, but directly on the root level (as currently proposed in https://github.com/reconciliation-api/specs/pull/80).

wetneb commented 2 years ago

Thanks for the summary, I have updated #80 to reflect that.

fsteeg commented 3 weeks ago

PR #80 is merged, closing.