koivunej commented 1 year ago

3663 made the semaphore be held until download completed, this sparked other discussion on if the single semaphore (100 permits) is a good limit and how should we limit it.

Original thread: https://neondb.slack.com/archives/C04C55G1RHB/p1677153938527409

Alternatives proposed:

we should have separate concurrency limits for index_part.json and layer downloads (@problame)
- refinement: or perhaps "other ops" (timeline list, indexpart up/down, deletes), and "layer up/down"? (@koivunej)
or all: a policy with (op, concurrency value) and index_part.json could have much higher concurrency value than layer download (@shanyp)

koivunej commented 1 year ago

Earlier I made an additional task requiring version of leaky bucket, or similar to that. After auditing https://github.com/udoprog/leaky-bucket it seems like a good implementation, and does not require a separate task, nor a channel for sending the permits. I'll PR this in.

Follow-ups after discussions about the elusive s3 prefix, I'm thinking that we might need this ratelimiter per each tenant_id assuming we don't already max out the bandwidth with rps=3500.

hlinnaka commented 1 year ago

Let's step back a bit. Why do we need to rate limit the S3 requests in the first place? From the slack thread:

assuming we have finite bandwidth and AWS infrastructure is able to give us full bandwidth s3 downloads at 100 requests, then I don't see any sense in doing more than 100 concurrent downloads, as in the whole will not become faster, there would just be much more scheduling overhead. but this would also require s3 to have negleable latency of course..the underlying http client should be reusing the tcp connections anyway as soon as the semaphore lets them.

Yes, ok, that makes some sense. You don't want to launch any more parallel downloads, if the server's network interface is already saturated.

I remember that we also had problems with large number of IAM requests. IAM has fairly low rate limits. But that was solved, we don't issue a separate IAM request for every GET request anymore. That should not be a problem anymore.

So let's be very clear about what we are trying to accomplish: We are trying to avoid fully saturating the server's network interface. Right?

For that, limiting the # of requests started per second doesn't make much sense. A small number of very large GET requests are much more likely to saturate the network bandwidth than a large number of small GET requests. We should measure network bandwidth more directly, not requests. We care about the total # of bytes / s.

The semaphore approach was not great. It basically assumed that each operation can do x MB/s, and then used the # of concurrent operations as a proxy for the network bandwidth used. Nevertheless, it seems more sensible than limiting the # of requests started per second. By rate limiting # of requests started, if each operation takes e.g 5 seconds, and the rate limit is 100 / s, you can have 500 concurrent operations in progress, which is more than we want. On the other hand, if each operation only takes 0.1 s, you are seriously throttling how many of those requests you allow. For no good reason AFAICS.

(There is also a limit in AWS S3 of 5500 GET requests / s per prefix (see https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html). We use a different prefix for each timeline, so if I understand correctly, we would need to do more than 5500 GET requests for a particular timeline to hit that limit. I don't think we need to worry about reaching that limit. And even if we do hit that limit, I think we can just let AWS do the throttling for us. We don't gain anything by throttling ourselves earlier.)

koivunej commented 1 year ago

I was hoping this would allow initial index_part.json requests to complete faster for faster activation, but it seems error rate has increased. Example log search {neon_service="pageserver"} |~ "Failed to (download|upload)" !~ "No space left on device") finds at least:

dispatch failure: io error: error writing a body to connection: Broken pipe (os error 32): Broken pipe (os error 32)
service error: unhandled error: unhandled error: Error { code: "SlowDown" ...
dispatch failure: other: connection closed before message completed
dispatch failure

I think there has to be some feedback loop to open connections from these; they appeared on servers using concurrency_limit=100 which I re-used as the requests per second limit. Of course, any of our async worker thread blocking may have stalled the downloads and thus caused more errors.

I'd say it was still worth testing it, if it would had helped to avoid the more complex solution like I understand you speculated above, which might be a hybrid of X inflight requests with bandwidth consumption?

I'll revert the PR #4292 next. I still think that we are not in a position to raise the semaphore limit assuming there is a requests per second limit on each prefix. Running into issues while testing the #4292 makes me think that the ratelimiting does not happen on just the prefixes, because otherwise we should had not hit it at all with 10k single timeline tenants, which should be 10k prefixes.

shanyp commented 1 year ago

on startup , shouldn't we first download all index_part.json and only then continue with rest of downloads, or as I suggested just give higher priority to these ?

koivunej commented 1 year ago

Well, that's how it should work right now but every tenant does it's own:

for each timeline
    download index_part.json
    reconcile
    schedule uploads

(in background per timeline: process upload queue)

I don't think we do any downloads right now during the initial load, because initial logical sizes are idle before all initial loads complete since #4399. We certainly could do uploads. Should probably add the metrics for the initial load time and then think about how to measure the wait times.

I think we now however around 1ms/tenant. I think that might be quite good for an eager approach.

neondatabase / neon

remote_storage: revisit throttling/ratelimiting #3698

3663 made the semaphore be held until download completed, this sparked other discussion on if the single semaphore (100 permits) is a good limit and how should we limit it.