logical size limit is broken during PS restart

problame commented 9 months ago

Problem

Logical size is part of PageserverFeedback, which is sent from PS to SK so that SK can enforce the project's logical size limit:

https://github.com/neondatabase/neon/blob/d8c21ec70d60f5e4a4675a16bc596cbf60eefc8f/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs#L398-L404

Logical size is calculated lazily. The value that is returned before it has been lazily calculated is the logical size delta since PS startup. If it's negative, we currently round it to 0.

The (quite common) worst case: whenever we restart the PS, there's a window in which we report a logical size that is way below the actual logical size, likely near 0. This allows a project to go over their logical size limit. Once we're done calculating, we report the correct value. But at that point, the user may be over the size limit. Which means they're using more logical size than they're allowed (and paying for?) .

Fixing This

We should not start walreceiver connections to SKs until we have an accurate logical size.

The challenge is that the logical size needs to be available quickly because walreceiver connection establishment is on the user-visible path, i.e., it's a latency-bound task.

Design Idea 1

Persistently cache the incremental logical size on disk and re-use it during startup.
Implement probablistic invalidation of the cache
Implement probablistic re-calculation of the base logical size because we don't fully trust the incremental logical size calculation (do we really not?)
Eventually: trust-but-verify the incremental logical size calculation, i.e., trust it but have something (local probablistic checker, control plane, whatever) trigger checks that would log errors & correct it.

(As a follow-up, also think about how this change impacts synthetic logical size calculations)

Design Idea 2

No persistent caching
Before establishing walreceiver connection, calculate the logical size.
- If the logical size calculation is already triggered by some background task but waiting behind some background-loop concurrency-limiter ( like proposed in #5955 ) , we'd need to somehow priority-promote that calculation

### Tasks
- [ ] https://github.com/neondatabase/cloud/pull/8317
- [ ] https://github.com/neondatabase/neon/pull/5982
- [ ] https://github.com/neondatabase/neon/pull/5995
- [ ] https://github.com/neondatabase/neon/pull/5999
- [ ] https://github.com/neondatabase/neon/pull/6018
- [ ] ship metrics & observe
- [ ] https://github.com/neondatabase/neon/pull/5994
- [ ] https://github.com/neondatabase/neon/pull/5955
- [ ] https://github.com/neondatabase/neon/pull/6000
- [ ] ship concurrency limit & observe
- [ ] https://github.com/neondatabase/neon/pull/6010
- [ ] ship fix & observe
- [ ] clean up after https://github.com/neondatabase/neon/pull/6018

hlinnaka commented 9 months ago

Idea 3: Store the logical size persistently as a separate key-value pair in the storage.

Whenever a relation is extended or truncated, update the logical size key-value pair too, in WAL ingestion.

That makes it fast to access the logical size, at any point in time, with no special caching required. The downside is that it adds work to the WAL ingestion codepath instead. Don't know how significant that is, but given how much trouble the logical size calculations are causing us, it might be the right tradeoff.

problame commented 8 months ago

Meeting notes today:

Intersects with SK sharding work: https://www.notion.so/neondatabase/Safekeepers-d06a3be92a0349dd9de486b7d2d2011d?pvs=4#a036fb0806634b2aa0fdaeae9a254de3
Not super urgent business-wise, but, easy to do
Would probably be enough to flip the order in PS startup sequence, small change, let's give it a budget of 1h

jcsp commented 3 months ago

We anticipate persisting snapshots of timeline logical sizes to remote storage in the near future to enable hibernated timelines (#8088 ), which should also enable us to ensure that we always have a logical size for a timeline. This may lag ingest a little bit after restart, but it will eliminate the 0 logical size phase.

neondatabase / neon