neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.94k stars 436 forks source link

remote_storage: revisit throttling/ratelimiting #3698

Open koivunej opened 1 year ago

koivunej commented 1 year ago

3663 made the semaphore be held until download completed, this sparked other discussion on if the single semaphore (100 permits) is a good limit and how should we limit it.

Original thread: https://neondb.slack.com/archives/C04C55G1RHB/p1677153938527409

Alternatives proposed:

koivunej commented 1 year ago

Earlier I made an additional task requiring version of leaky bucket, or similar to that. After auditing https://github.com/udoprog/leaky-bucket it seems like a good implementation, and does not require a separate task, nor a channel for sending the permits. I'll PR this in.

Follow-ups after discussions about the elusive s3 prefix, I'm thinking that we might need this ratelimiter per each tenant_id assuming we don't already max out the bandwidth with rps=3500.

hlinnaka commented 1 year ago

Let's step back a bit. Why do we need to rate limit the S3 requests in the first place? From the slack thread:

assuming we have finite bandwidth and AWS infrastructure is able to give us full bandwidth s3 downloads at 100 requests, then I don't see any sense in doing more than 100 concurrent downloads, as in the whole will not become faster, there would just be much more scheduling overhead. but this would also require s3 to have negleable latency of course..the underlying http client should be reusing the tcp connections anyway as soon as the semaphore lets them.

Yes, ok, that makes some sense. You don't want to launch any more parallel downloads, if the server's network interface is already saturated.

I remember that we also had problems with large number of IAM requests. IAM has fairly low rate limits. But that was solved, we don't issue a separate IAM request for every GET request anymore. That should not be a problem anymore.

So let's be very clear about what we are trying to accomplish: We are trying to avoid fully saturating the server's network interface. Right?

For that, limiting the # of requests started per second doesn't make much sense. A small number of very large GET requests are much more likely to saturate the network bandwidth than a large number of small GET requests. We should measure network bandwidth more directly, not requests. We care about the total # of bytes / s.

The semaphore approach was not great. It basically assumed that each operation can do x MB/s, and then used the # of concurrent operations as a proxy for the network bandwidth used. Nevertheless, it seems more sensible than limiting the # of requests started per second. By rate limiting # of requests started, if each operation takes e.g 5 seconds, and the rate limit is 100 / s, you can have 500 concurrent operations in progress, which is more than we want. On the other hand, if each operation only takes 0.1 s, you are seriously throttling how many of those requests you allow. For no good reason AFAICS.

(There is also a limit in AWS S3 of 5500 GET requests / s per prefix (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html). We use a different prefix for each timeline, so if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit. I don't think we need to worry about reaching that limit. And even if we do hit that limit, I think we can just let AWS do the throttling for us. We don't gain anything by throttling ourselves earlier.)

koivunej commented 1 year ago

I was hoping this would allow initial index_part.json requests to complete faster for faster activation, but it seems error rate has increased. Example log search {neon_service="pageserver"} |~ "Failed to (download|upload)" !~ "No space left on device") finds at least:

I think there has to be some feedback loop to open connections from these; they appeared on servers using concurrency_limit=100 which I re-used as the requests per second limit. Of course, any of our async worker thread blocking may have stalled the downloads and thus caused more errors.

I'd say it was still worth testing it, if it would had helped to avoid the more complex solution like I understand you speculated above, which might be a hybrid of X inflight requests with bandwidth consumption?

I'll revert the PR #4292 next. I still think that we are not in a position to raise the semaphore limit assuming there is a requests per second limit on each prefix. Running into issues while testing the #4292 makes me think that the ratelimiting does not happen on just the prefixes, because otherwise we should had not hit it at all with 10k single timeline tenants, which should be 10k prefixes.

shanyp commented 1 year ago

on startup , shouldn't we first download all index_part.json and only then continue with rest of downloads, or as I suggested just give higher priority to these ?

koivunej commented 1 year ago

Well, that's how it should work right now but every tenant does it's own:

for each timeline
    download index_part.json
    reconcile
    schedule uploads

(in background per timeline: process upload queue)

I don't think we do any downloads right now during the initial load, because initial logical sizes are idle before all initial loads complete since #4399. We certainly could do uploads. Should probably add the metrics for the initial load time and then think about how to measure the wait times.

I think we now however around 1ms/tenant. I think that might be quite good for an eager approach.