Pageserver is allegedly takes a lot of time to restart when there are a lot of tenants

kelvich commented 1 year ago

It's more of an observation, so should be verified first. Staging has pageservers with 40k+ tenants

kelvich commented 1 year ago

cc @hlinnaka

koivunej commented 1 year ago

With 40k+ tenants we probably do not get metrics anymore?

This is most likely related to #4025.

LizardWizzard commented 1 year ago

Discussion happens in this long thread: https://neondb.slack.com/archives/C033RQ5SPDH/p1685012031795059

koivunej commented 1 year ago

I posted earlier attempts (#4366, revert) on #4366. After #4372 it looks a bit more promising without too intrusive changes:

after deploying #4366 on staging:

ps-0.eu-west-1 (10k): 100s => 37s, 6s
ps-1.eu-west-1 (8k): 73s => 5s, 5.5s
ps-99.us-east-2 (<2k?): 2.8s => 2.3s, 2s

so I think this looks at least not bad.

But I haven't been able to retry these results yet. I suspect that the remaining problem is the blocking of the background runtime for initial logical size AND repartitioning. The "page_service connection pressure" has been brought up as an idea to lower the activation time for timelines which are being being re-connected to.

Designing and implementing such prioritization system might not be straightforward. Basically it would have to act as a semaphore, but upon getting a notification of page_service connection, it should allow these instaces to jump the queue. But what would this prioritization protect? The first initial logical size calculation's?

Perhaps an easier step is to delay initial repartition + compaction and garbage collection until we've attempted all initial logical size calculations. This should probably delay the timeline's eviction task as well just to be sure. Unsure if this is the right path, because we might end up in a situation that some timelines do not get an active walreceiver connection, and so they would not get an initial logical size calculation happening.

koivunej commented 1 year ago

With #4397 staging startup times:

ps-0.eu-west-1 (8k): 4.6s, 4.0s
ps-1.eu-west-1 (8k): 3.4s, 3.5s
ps-99.us-east-2 (<2k?): 2.1s, 2.3s

Not really comparable anymore, because ps-0 lost 2k tenants. However, the high values are no longer expected.

The #4399 would further help this by delaying all initial logical size calculations to a phase which runs after we've completed activating all tenants. There will be no background jobs running until timeout (10s by default). It is assumed that the 10s would be spent efficiently doing many queued up initial logical size calculations before letting the compactions start.

koivunej commented 10 months ago

I'll just close this because after changes we have now different causes. Originally this helped.

neondatabase / neon

Pageserver is allegedly takes a lot of time to restart when there are a lot of tenants #4183