neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.24k stars 406 forks source link

Epic: Improve LastWrittenLsn mechanism #2760

Open MMeent opened 1 year ago

MMeent commented 1 year ago

Motivation

Currently, we have a limit of N=128 block ranges (or 1028 blocks after #2493) with LRU eviction. This is OK, but we could evict many pages at once, resulting in a relatively high overall evicted LSN.

Because a high LastWrittenLSN means a short WAL propagation time to PageServer, it is better to have the lowest possible (within reason) LastWrittenLSN (or: the lowest reasonably possible LSN in the request). Even for pages that haven't been updated in a long time and where the WAL propagation delay is not an issue, a lower LastWrittenLsn will result in higher overall performance because we are able to exclude more layers from the search for WAL in the Pageserver.

Implementation ideas

Pairingheap (?) cache eviction

We could change our cache eviction policy to one based on the lowest LSN in the cache, e.g., through a pairing heap on the lowest LSN. This way, we could have

overallLastWrittenLsn - lowest granularity LSN for all blocks not in the rest of the infra
recentlyEvicted - N entries associated with BufferTag
currentLsn - max LSN we can expect in any block

overallLastWrittenLsn < recentlyEvicted (N entries) < currentLsn

This results in more precise LSN lookups for pages where it matters - those with a low LSN.

Note that we use this LSN stuff only to ensure that Pageserver is up-to-date enough, so it makes sense to keep granular data for only the most recently updated evicted pages.

Probabilistic lastWrittenLsn

We could update our LSN cache to one based on count-min sketches: we maintain N buckets, and for all evicted pages, we hash the page ID to correlate to M (M < N) buckets and update those buckets' lastWrittenLsn value to Max(currentVal, pageLsn). When looking up the lastWrittenLsn for a page, we do Min(associated buckets' lastWrittenLsn), which is the actual last written LSN for that page. The benefit is a better estimated last written LSN in less memory than a hash map on all pages (no hashmap-related maintenance infrastructure necessary), and it covers the whole database. I'll refer to "count-min sketches" and "Counting bloom filters" for more details on why this would work, though here we use "Max(insert_value, current_value)" instead of "current_value++" in the buckets.

This could replace only XLogCtl->maxLastWrittenLsn, or both XLogCtl->maxLastWrittenLsn and the lastWrittenLsnCache. I don't care much about either, but precise data (see Pairingheap cache eviction above) could be useful as a first item, after which this probabilistic structure could provide the next, more granular layer of LSNs.

Note that this probabilistic lastWrittenLsn map could also be used in the pageserver, as it could be used to help determine the age of a block in the cluster.

Other notes

We should probably do something to pull the LSN tracking code to the extension.

MMeent commented 1 year ago

CC @knizhnik My thoughts on improving the overall performance of our lastWrittenLsn infrastructure.

This would lower the chance of an unnecessarily high lastWrittenLsn and thus having to wait for WAL.

MMeent commented 1 year ago

Some more analysis on the "Probabilistic lastWrittenLsn":

  1. Because a bucket's value is the largest of all blocks that hash to that bucket, the order of insertion doesn't matter for the value of the resulting data structure. Insertion order [A, B] will have the same result as [B, A].
  2. The value of a bucket is the LSN of the first block that hashes to that bucket when using ordered insertion of blocks by largest LSN.
  3. When inserting in order of largest LSN, the N-th bucket that receives a value therefore also contains the N-th largest value.

This means we can use normal hash collision statistics for the expected values of buckets: The N-th largest bucket has the value that corresponds to the p-th page's LSN when all pages are ordered by the largest eviction LSN. The number of pages p we can insert before we expect to fill the n-th bucket of N total buckets is determined by inversing the birthday paradox, which results in $p \approx -N \ln(1 - \frac{n}{N})$ [^0].

Using this calculation, we can see that the number of pages' lastWrittenLsns we store in a given percentage of buckets $n = percentage * N$ is linearly dependent on the number of buckets:

Substitute $n$ in the base equation

p \approx -N \ln(1 - \frac{percentage * N}{N})

${N \over N} = 1$, substitute

p \approx -N \ln(1 - percentage)
Expand for some pre-calculated fractions

| percentile of bucket | estimated number of pages >= bucket value inserted | -- | -- 0.050% | 0.0005 * N buckets 0.100% | 0.001 * N buckets 0.250% | 0.0025 * N buckets 0.500% | 0.005 * N buckets 1.000% | 0.0101 * N buckets 2.500% | 0.0253 * N buckets 5.000% | 0.0513 * N buckets 10.000% | 0.1054 * N buckets 20.000% | 0.2231 * N buckets 30.000% | 0.3567 * N buckets 40.000% | 0.5108 * N buckets 50.000% | 0.6931 * N buckets 60.000% | 0.9163 * N buckets 70.000% | 1.204 * N buckets 80.000% | 1.6094 * N buckets 90.000% | 2.3026 * N buckets 95.000% | 2.9957 * N buckets 97.500% | 3.6889 * N buckets 99.000% | 4.6052 * N buckets 99.500% | 5.2983 * N buckets 99.750% | 5.9915 * N buckets 99.900% | 6.9078 * N buckets 99.950% | 7.6009 * N buckets 99.975% | 8.294 * N buckets 99.990% | 9.2103 * N buckets

I've not yet gone through the calculations about what would happen if we had multiple hashes or utilized multiple data structures, but I think the best that would do is reduce variance, as we're still working with only N buckets.


From the topic post:

Note that this probabilistic lastWrittenLsn map could also be used in the pageserver, as it could be used to help determine the age of a block in the cluster.

Using the above heuristics, I don't think this is realistic anymore - we would want a map that is large enough so that we would realistically store each page's LwLsn with good precision, and this does not seem to provide that.

[^0]: ref: https://math.stackexchange.com/questions/659008/inverse-birthday-estimation

knizhnik commented 1 year ago

I think that speaking about last written LSN cache we should not forgot about two things:

  1. Shared buffers. Last recently access (and modified) pages are used to be be present in shared buffers for some time. Yes, in case of using ring buffer this time can be quite small. But still probability that recently updated page will be evicted is quite small.
  2. Backpressure. Right now lag between pageserver and compute is limited by 15Mb. It is just 2000 pages. And default size of last written LSN cache is 128k entries.

so I do not think that current last written LSN cache can cause some slowdown of Neon - just because get_page_at_lsn has to wait until irrelevant changes are applied.

The only exception is prefetch. Here, as we see, last written LSN cache miss + updates of traverse pages (as in vacuum) can cause rejection of prefetch requests. But in this case setting last written LSN for prefetched pages can help. Yes, it will "pollute" cache, but still traversing of 128k pages will take significant amount of time, given a pagesever good chance to proceed 2k pages

MMeent commented 1 year ago
  1. Backpressure. Right now lag between pageserver and compute is limited by 15Mb. It is just 2000 pages.

I disagree with this calculation. Records that modify a page only need 44 bytes to do that (xlog header @ 24B plus a block registration at 4 bytes, plus db/nsp/rel/blkno at 16B, remaining record data is optional); resulting in 312k pages being modified (when rounding record sizes up to 48B).

And default size of last written LSN cache is 128k entries. so I do not think that current last written LSN cache can cause some slowdown of Neon

I disagree here as well - cache is currently LRU, so the last evicted value is added and the least recently evicted value is removed. If that distance is small enough and WAL volume low enough, that will result in waits.

And it's not only get_page_at_lsn in Pageserver that needs to wait when false eviction happens; we also need the LwLsn cache to ensure no page was modified and evicted after prefetching it (your prefetch eviction issue from last weekend).

For that, we need a good and accurate cache of which the behavior we can reason about. Re-inserting the prefetched pages does work but adds additional (locking) overhead that we shouldn't need. In addition, the write-in-cache system would fail on large enough prefetch distances with enough concurrent connections by thrashing the cache. E.g., 64 pages readahead = 64 evictions + reads before hitting the page per backend, = 1024 concurrently active connections, which would thrash the caches. Not entirely realistic in our case right now, but not infeasible either.

knizhnik commented 1 year ago

I disagree with this calculation.

Yes, you are certainly right that one WAL page may contain multiple records and so touch multiple page. In case of random updates (pgbench) situation will be close to one you have described (no 44 bytes, but but still ~hundred dirty pages. But I have considered mostly COPY+vacuum workload. In this case mostly FPI are used and number of modified pages ~= size of WAL.

I disagree here as well - cache is currently LRU, so the last evicted value is added and the least recently evicted value is removed.

Yes, eviction policy is LRU. But first of all "least recently evicted value is removed" is not true. Page is moved to the head of LRU list on each access (update of last written LSN). Last written LSN is set when page is written by SMGR. It doesn't mean that it is evicted. I thought about maintaining really "last evicted LSN" rather than "last wiritten". But it requires non-trivial patch in bufmgr and doesn't give any advantage on any workload I have tried.

It seems to be useless to dispute here whether 128k cache is enough to avoid pageserver waits or not. I didn't see use cases where pageserver wait time (which can be observed using Prometheus metrics) becomes limiting factor for performance. This is what allows me to write that I do not think that we have problem here. If you disagree, please point me on the example which demonstrates it.

we also need the LwLsn cache to ensure no page was modified and evicted after prefetching i

Yes, and here we really faced with the problem during VACUUM (and with default shared_buffers=1MB). But I hope that PR with setting last written LSN for prefetched request solves the problem. My first thought was the we really need more "accurate" cache here. But then it seems to me that much simpler solution will be enough.

Please notice that the the problem takes place only with vacuum/bulk update. And scenario with short WAL records and updaing multiple pages by one WAL page is result of random OLTP workload. But in the last case prefetch is not performed at all. And bottleneck most likely will be compute-pageserver roundrip (if working set doesn't fit in compute cache) which server as some kind of packpressure and prevents large gap of current LSN between compute and pageserver.

For that, we need a good and accurate cache of which the behavior we can reason about. Re-inserting the prefetched pages does work but adds additional (locking) overhead that we shouldn't need. In addition, the write-in-cache system would fail on large enough prefetch distances with enough concurrent connections by thrashing the cache.

You disagree with my arguments, I disagree (partly) with your arguments... I do not thick that before trying to implement some more accurate last written LSN cache we should first try create some benchrmark which demonstrates the problem. It is not good to try to optimize something based just on speculations. Prefetch is any case applicable mostly for OLAP queries and OLAP workoad rarely requires larger number of parallel queries. So one again, your arguments about larger number of concurrent connections performing queries with larger prefetch distance may never happen in real life. Even if we have parallel vacuum, which concurrently process multiple tables, still there will be just few active backends.