Open problame opened 9 months ago
Idea 3: Store the logical size persistently as a separate key-value pair in the storage.
Whenever a relation is extended or truncated, update the logical size key-value pair too, in WAL ingestion.
That makes it fast to access the logical size, at any point in time, with no special caching required. The downside is that it adds work to the WAL ingestion codepath instead. Don't know how significant that is, but given how much trouble the logical size calculations are causing us, it might be the right tradeoff.
Meeting notes today:
We anticipate persisting snapshots of timeline logical sizes to remote storage in the near future to enable hibernated timelines (#8088 ), which should also enable us to ensure that we always have a logical size for a timeline. This may lag ingest a little bit after restart, but it will eliminate the 0
logical size phase.
Problem
Logical size is part of
PageserverFeedback
, which is sent from PS to SK so that SK can enforce the project's logical size limit:https://github.com/neondatabase/neon/blob/d8c21ec70d60f5e4a4675a16bc596cbf60eefc8f/pageserver/src/tenant/timeline/walreceiver/walreceiver_connection.rs#L398-L404
Logical size is calculated lazily. The value that is returned before it has been lazily calculated is the logical size delta since PS startup. If it's negative, we currently round it to 0.
The (quite common) worst case: whenever we restart the PS, there's a window in which we report a logical size that is way below the actual logical size, likely near 0. This allows a project to go over their logical size limit. Once we're done calculating, we report the correct value. But at that point, the user may be over the size limit. Which means they're using more logical size than they're allowed (and paying for?) .
Fixing This
We should not start walreceiver connections to SKs until we have an accurate logical size.
The challenge is that the logical size needs to be available quickly because walreceiver connection establishment is on the user-visible path, i.e., it's a latency-bound task.
Design Idea 1
Persistently cache the incremental logical size on disk and re-use it during startup.
Implement probablistic invalidation of the cache
Implement probablistic re-calculation of the base logical size because we don't fully trust the incremental logical size calculation (do we really not?)
Eventually: trust-but-verify the incremental logical size calculation, i.e., trust it but have something (local probablistic checker, control plane, whatever) trigger checks that would log errors & correct it.
(As a follow-up, also think about how this change impacts synthetic logical size calculations)
Design Idea 2