neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.44k stars 419 forks source link

Epic: page reconstruction (materialization) criteria #1481

Open knizhnik opened 2 years ago

knizhnik commented 2 years ago

Right now the only criteria for image layer generation is amount of delta layers and compaction timeout. With current default settings, i is necessary to write about 8Gb of WAL before image layer will be created. As a result it can happen that it we load data in the database and then perform mostly read-only queries, then image layer will never be created and we have to reconstruct page each time when it is accessed. And it is about 5 times slower than reading it from image layer.

In this PR #1468 I tried to force creating of image layers based on min/max page reconstruction timeouts. The idea was that if page was updated then after some timeout expiration we should reconstuct it, so that subsequent get_page_at_lsn requests do not have to reconstruct it and so significantly reduce page access latency.

The problem of such approach is write amplification. Because together with the target page we have to write 128Mb of other pages. Which may not changed at all. We may try to find some other storage for the reconstructed page other than image layer. We already have in-memory page cache. We can also extend it by on-disk storage. Or include reconstructed page image in delta layer during recompaction or in-memory layer eviction (last one is not sired because it may increase replication_flush_lag).

Ideally we should reconstruct only those page which will be requested by compute node. How can we predict it? The simplest solution will be to collect information about pages evicted from shared buffers and send this information to page server. Certainly naive sending of separate for each page may be extremely inefficient. But we can group several such requests and send them together with some other smgr request (i.e. get_page_at_lsn).

If page is evicted from compute node shared buffers, then it was not accessed for some time. So most likely this page will not be updated in near future. So we can reconstruct it with small risk of doing wasteful job.

Concerning possible implementation: we can extend smgr API by adding smgr_evict function. Or we can use only smgr_write function. Definitely call of smgr_write doesn't mean that page will be evicted fro shared buffers. It can be just flushed by bgwriter. But even such information from compute node is better than nothing.

Thoughts?

bojanserafimov commented 2 years ago

+1 on the existence of the problem. Here are some tests that show at least a part of the problem.

+1 on the general idea of taking ephemeral page images independent of image layers. There's no (or very little) reason to upload these page images to cloud storage, and the ephemeral nature allows us flexibility in the storage format (append-only file vs contiguous mapping, etc)

About your proposed heuristic: If we have a read replica, then probably every page is evicted from either the primary or the replica. So the heuristic of materializing only evicted pages is useful initially, but it only gets us so far. Even without replicas, it's mostly useful for small databases, where the majority of pages are not evicted.

Also, there's no (amortized) cpu cost of eagerly doing wal redo for pages that won't be read by compute. They will eventually need to be reconstructed in order to create an image layer anyway. The only cost to eager materialization is write amplification and bloat.

Another heuristic would be to materialize the pages with most updates. This way we minimize the page reconstruction work that needs to be done later. This is not perfect either, I haven't arrived at a strong conclusion.

Side note (follows from my RFC): We should only be thinking about materializing only the latest cold version of each page. No need to materialize pages in the middle of a delta layer, only at the end. (I think you agree, but worth mentioning)

knizhnik commented 2 years ago

If we have a read replica, then probably every page is evicted from either the primary or the replica.

First of all I do not want to discuss read replicas right now. Let's first concentrate on master compute node. The necessity of read replicas in Zenith architecture and the way of maintaining such replicas is still unclear to me.

But still your statement that "every page is evicted from either the primary or the replica" is unclear to me. Eviction of page from shared buffers depends on access pattern. It can be completely different for master and replica (hot standby). But in any case the fact of eviction page from shared buffer just means that compute node (master or replica) doesn't have it any more and once it is requested it has to be downloaded from pageserver. So there is good argument to reconstruct it. In case of master we can also guaranty that nobody will update this page before requesting it. So it's reconstrucrted copy will always be useful.

Also, there's no (amortized) cpu cost of eagerly doing wal redo for pages that won't be read by compute.

If compute node continue updates of this page then eager page reconstruction is just waste of CPU - this "intermediate" page versions will never be needed. Assume that we have series of updates of some page U1, U2, U3, ... Un. If we reconstruct page after each update, then we have to perform N reads of page, N writes of page and apply N WAL records. If we delay page reconstruction, then we need just one read just one write and N appliers. It is expected to consume ~N times less CPU than first case.

The only cost to eager materialization is write amplification and bloat.

Not only: CPU time and memory and memory bandwidths.

Another heuristic would be to materialize the pages with most updates.

Not so good criteria, because if page is frequently updated, then most likely it is cached in shared buffers, so is never requested from page server.

No need to materialize pages in the middle of a delta layer,

Size of delta layer is 128Mb (with intention to increase it to 1Gb). Filling such layer may take some time. Assume that we perform random update. Size of update record is 50bytes. So 128Mb layer means update of 2 million pages - 16Gb. It is much more than 128Mb shared buffers. So compute node will have to evict and request pages from page server. So delaying image layer generation till end of delta layer may not be possible.

bojanserafimov commented 2 years ago

Good points.

If compute node continue updates of this page then eager page reconstruction is just waste of CPU - this "intermediate" page versions will never be needed. Assume that we have series of updates of some page U1, U2, U3, ... Un. If we reconstruct page after each update, then we have to perform N reads of page, N writes of page and apply N WAL records. If we delay page reconstruction, then we need just one read just one write and N appliers. It is expected to consume ~N times less CPU than first case.

  1. Yes, I agree there's read/write cost for storing/loading the intermediate results. But there's no repeated redo work. Whether redo is done in multiple batches or one, the total number of wal entries redone is the same.
  2. N is at most 3. The most aggressive eager materialization is to materialize all modified pages at the end of the delta layer. After 3 delta layers we take an image layer. I'm not proposing we materialize all modified pages, btw.
  3. Also consider that eager materialization might decrease read/write cost. We're reading/writing 3x more data, but all reads are sequential 8kb reads, which are as fast as sequential reads on SSDs. Without any eager materialization, we're reading scattered PageReconstructData from SSD, in chunks that can be as small as 50 bytes. Not sure how big this effect will be, and whether it makes up for the read/write bloat.

Not so good criteria, because if page is frequently updated, then most likely it is cached in shared buffers, so is never requested from page server.

You're probably right.

Size of delta layer is 128Mb (with intention to increase it to 1Gb). Filling such layer may take some time. Assume that we perform random update. Size of update record is 50bytes. So 128Mb layer means update of 2 million pages - 16Gb. It is much more than 128Mb shared buffers. So compute node will have to evict and request pages from page server. So delaying image layer generation till end of delta layer may not be possible.

We are dealing with two separate problems:

  1. Avoid page reconstruction from SSD
  2. Avoid page reconstruction from RAM

My intuition says that the first problem is much larger. Not sure yet. It's a good question to answer with tests. Or at least it would be productive to discuss the two problems separately.

In the second problem (avoiding reconstruction from RAM), as you said, it might make sense to materialize some intermediate results, not only latest state. The simplest heuristic I can think of is to take a page image once we have 24kb of wal for the same page in memory. This is similar to how we take an image layer after 3 delta layers, but much more granular. I'm sure there might be a better heuristic though.

knizhnik commented 2 years ago

But there's no repeated redo work. Sure. But my measurements shows that time of apply one wal record is almost the same as batch (100) of wal records. It is price of interprocess communication through the pipe.

N is at most 3. Sorry, I do not understand it. Yes, there are at most 3 delta layers when we trigger generation of image layer. But each delta layer can contain arbitrary number of changes (wal records) for the particular page,

Also consider that eager materialization might decrease read/write cost. Here I disagree. First of all 8kb is too small unit. If your are reading/writing random pages then it is speed of random IO which is several times lower than speed of sequential IO. Even for SSD (for HDD with its ~10msec average seek time, it is show stopper). To eliminate price of random IO, you need to use several Mb chunks.

The extreme form of eager page reconstruction is ... standard postgres redo: apply wal record immediately. And there are two main limiting performance factors here:

  1. Sequential apply of WAL: at master changes are made by several concurrent transaction, but at replica we have single wal receiver process which has to apply this wal records sequentially.
  2. Random page access. To apply changes, we first need to load target page. If random pages are updated, then once again our performance is limited by random access speed. This is why there are several extension for postgres - wal prefaulter (one of them is at commit fest) which tries to prefetch affected pages.

In Zenith we can eliminate this problem by scattering WAL by key and applying it in batch mode in background. Please notice that get_page_reconstruct data in most cases reads data just from one sequential segment of delta layer. And it is one more argument to delay materialization, because wenot only sending batch of records through the pipe to wal redo process, but also reading them as batch using one read operation from delta layer.

My intuition says that the first problem is much larger. My vision of the problem is different. I have explained it in the previous answer. Page reconstruction should be as lazy as possible. In minimize overhead at all processing phases:

  • reading target page
  • reading wal records to be applied
  • sending original image + wal records through the pipe to wal redo
  • storing materialized page The more records we can apply in one batch, the more efficient wal redo is. But it should be done BEFORE page is actually requested. Unfortunately, wedo not know for sure which pages will be requested by compute node. We can only try to predict it. Using machine learning or whatever else.

The simplest strategy is to reconstruct pages evicted from shared buffers, hoping that sometimes them will be requested by compute node.

bojanserafimov commented 2 years ago

Sure. But my measurements shows that time of apply one wal record is almost the same as batch (100) of wal records. It is price of interprocess communication through the pipe.

You're assuming the records are in RAM. It grows with number of records

Sorry, I do not understand it. Yes, there are at most 3 delta layers when we trigger generation of image layer. But each delta layer can contain arbitrary number of changes (wal records) for the particular page,

All consecutive changes in the delta layer can be applied in one batch.

First of all 8kb is too small unit. If your are reading/writing random pages then it is speed of random IO which is several times lower than speed of sequential IO. Even for SSD (for HDD with its ~10msec average seek time, it is show stopper). To eliminate price of random IO, you need to use several Mb chunks.

See figure 3. We can see if the same is true for the machines we use.

The simplest strategy is to reconstruct pages evicted from shared buffers

Actually you're right, it's a pretty good and simple strategy for the in-mem layer. It should have few evicted pages so we won't end up materializing too much.

Beyond in-mem, I've argued above that all the pages you'll end up materializing will save a lot of effort, but we will end up materializing the entire database. What percentage of a 1TB database do you expect to be in shared buffers?

To be clear I'm not opposed to that. It adds some write amplification but it's worth it. But you weren't receptive at all to my suggestion to do this a while ago, so I assumed you are not ok with this amount of write amplification.

knizhnik commented 2 years ago

I tried to implement propagation of information about evicted pages from compute node to page server in order to force reconstruction of this pages. I have added smgrevict record to SMGR API which is called by BufferAlloc when it tries to find victim buffer:

        if (oldFlags & BM_TAG_VALID)
        {
            smgrevict(smgr,
                      buf->tag.forkNum,
                      buf->tag.blockNum,
                      lsn);

First surprise: when I moved call of SetLastWrittenPageLSN from zenith_wallog_page (which is called from zenith_write and zenoth_extend) to zenith_evict, then I got lot of wal-redo errors (updated item pointer doesn't exists). It has to be investigated, because we actually need last evicted rather than last written LSN.

Evicted pages are just lookuped at page server by separate thread. Them are expected to be pushed in page cache (which size was extended to 2Gb, so that all reconstructed pages should fit in memory.

Unfortunately performance improvement was not so large as I expected. With 10 clients pgbench scale 100 shows almost the same results: 3493 vs. 3423 TPS. Looks like larger number of clients eliminate page reconstruction latency. But for one client the difference is more noticeable: 292 vs. 254 TPS. Not so impressive, indeed.

May be the reason is that page cache hit doesn't prevent now lookup for deltas in layers. It is because LSN of cached image is used to be smaller than request LSN (because updates of other pages happen after this page was cached). I am trying now to add invalidation mechanism to materialized pages cache, so that we can use cached results without further checks.

shanyp commented 1 year ago

@knizhnik is there any followup needed on this one ?