neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
13.18k stars 367 forks source link

Epic: search and analyze storage worst-case scenarios #1076

Closed kelvich closed 2 years ago

kelvich commented 2 years ago

https://github.com/zenithdb/zenith/issues/906 showed that we don't have any immediate problems with the big databases. Since increasing the database size does not automatically reveals any innate architecture problems let's create such cases manually.

One scenario that I can think of is the following: 1) we have a small one gigabyte table t1 and a big one terabyte table t1000 2) test generates 1gb of updates to t1 and 100k random updates in t1000 (assuming ~100 bytes wal records, that would be about ~10mb of wal). With 100k random updates to t1000 we should make almost all 10 mb layer/segment dirty. And updates to t1 should trigger checkpoint in the pageserver. 3) wait a bit until checkpoint is done 4) go to 2)

If I understand current pageserver algorithm correctly, that scenario would write 1TB of data on each iteration, caused by ~1gb of postgres data changes.

That is somewhat an artificially bad scenario and we need to have some quantitative metrics that tells us is any real database load is "good" or "bad". So we need to have run-time metrics in the pageserver to measure write amplification and disk bloat. In the spirit of RUM conjecture (http://daslab.seas.harvard.edu/rum-conjecture/) it makes sense to track following things in the pageserver:

I propose the following plan of actions:

bojanserafimov commented 2 years ago

There's a test that does something similar.

Both of these tests seem to target the fact that we write too many image layers where delta layers would have been fine? Seems fixable, but let's talk tomorrow and see if I correctly understand the layered repo (and how it becomes a bottleneck). I don't quite understand why we need a mix of dense + sparse writes. Seems like if you can fill RAM with sparse writes you can trigger an eviction with equally bad (amortized) write amplification.

knizhnik commented 2 years ago

Finally I was able to reproduce this worst scenario. The problem with my previous attempts was that... our checkpointer is too slow. I.e. doesn't matter what we are doing, we are generating WAL faster then checkpointer is able to write it. So to get the largest write amplification we need:

  1. perform minimal number of random updates with affect all big table segments
  2. write ~256Mb data to some small table to trigger default checkpoint distance
  3. initiate checkpoint and wait it's completion

It is achived by the following test:

def test_write_amplification(zenith_simple_env: ZenithEnv):
    env = zenith_simple_env
    env.zenith_cli(["branch", "test_write_amplification", "empty"])
    pg = env.postgres.create_start('test_write_amplification')
    n_iterations = 100
    seg_size = 10 * 1024 * 1024
    n_segments = 10 * 1024 # 100Gb
    payload_size = 100
    header_size = 28
    tuple_size = header_size + payload_size
    big_table_size = n_segments * (seg_size // tuple_size)
    checkpoint_distance = 256 * 1024 * 1024
    small_table_size = checkpoint_distance // tuple_size
    redundancy = 3

    print(f"Big table size: {big_table_size}")
    print(f"Small table size: {small_table_size}")
    print(f"Iterations: {n_iterations}")
    print(f"Random updates per checkpoint: {n_segments * redundancy}")

    with closing(pg.connect()) as conn:
        with conn.cursor() as cur:
            with closing(env.pageserver.connect()) as psconn:
                with psconn.cursor(cursor_factory=psycopg2.extras.DictCursor) as pscur:

                    # Get the timeline ID of our branch. We need it for the 'do_gc' command
                    cur.execute("SHOW zenith.zenith_timeline")
                    timeline = cur.fetchone()[0]

                    # Create a test table
                    cur.execute("CREATE TABLE Big(pk integer primary key, count integer default 0, payload text default repeat(' ', {payload_size}))")
                    # populate ~100Gb in Big
                    cur.execute(f"INSERT INTO Big (pk) values (generate_series(1,{big_table_size}))");
                    cur.execute(f"CREATE TABLE Small(pk integer, payload text default repeat(' ', {payload_size}))")
                    for i in range(n_iterations):
                        print(f"Iteration {i}")
                        # update random keys
                        for i in range(n_segments * redundancy):
                            key = random.randint(1, big_table_size)
                            cur.execute(f"update Big set count=count+1 where pk={key}")

                        # bulk insert small table to trigger checkpoint distance
                        cur.execute(f"insert into Small (pk) values (generate_series(1, {small_table_size}))")
                        # And truncate it to avoid database size explosion
                        cur.execute("truncate Small")
                        # Run GC to force checkpoint and save image layers
                        pscur.execute(f"do_gc {env.initial_tenant} {timeline} 0")
                        row = pscur.fetchone()
                        print_gc_result(row)
    # Wait forever to be able to inspect statistic
    print('Test completed')
    time.sleep(100000)

At each iteration which takes about 400 seconds it produce ~256Mb of WAL and write to the disk ~130Gb. So write amplification is ~500 times.

knizhnik commented 2 years ago

Also loading of 130Gb table with primary key takes 130 minutes (1Gb/min). It seems to be much worse than with Vanilla Postgres. In case of uploading data to table without indexes speed was almost the same: 7 hours for 1.3Tb which corresponds to speed 3Gb/min.

bojanserafimov commented 2 years ago

After running this test, the timeline was 259MB, but 252MB of those were image files.

A simple solution is to make sure that each image file is preceded by consecutive delta files that are as big as the image in total. Tweak some logic here. This way disk bloat would be exactly 2, and read amplification would be worst case 2, but average 1.5 (because the average get_page@lsn request consumes half the deltas between two images).

In the test example above, instead of 7MB of deltas and 252MB of images, we'd have 7MB of each, 14MB in total. The test numbers will be different on bigger data (I can try @knizhnik 's test), but we'd still have disk bloat of 2 and read amplification worst case 2 and average 1.5.

Though I'm not sure if we want to exactly balance disk bloat and read amlpification because i don't know how cheap wal redo is. But we can tune a parameter k: if we set a target total_image_size = k * total_delta_size then we have:

P.S: Most of the get_page@lsn queries where latency matters should be about the most recent version of the page. So if we make sure that the last layer on disk for each segment is an image layer, then we'd effectively have read amplification = 1. We can achieve this by always writing an image layer when evicting, but deleting the second-most-recent image if it falls too close to the most recent one. We're not adding any disk bloat with this strategy, but write amplification can get bad.

knizhnik commented 2 years ago

A simple solution is to make sure that each image file is preceded by consecutive delta files that are as big as the image in total. Unfortunately, simple solution will not work. First of all we have to periodically perform some kind of backups and backups IMHO should contain page images. Just for simplicity and speed of apply. So even if one page of 10Mb segment is updated, then we still have to produce new image layer.

Second argument is that one page (or even one record) of a segment can be updated multiple times. Applying 10Mb of WAL to reconstruct this page will take too much time. As far as the only way to permanently store reconstructed page now is to create new image layer, then we will have to create this layer to keep reconstruct time reasonably small. Yes, we have page cache now and ephemeral files. But both will be missed on page server restart. So we can rely on on delta and image layers.

But I agree that the only way avoid write amplification is to delay creation of new image layer. If checkoint is performed with period one second, then definitely it is not a good idea to create new image layer each second, Especially if there are only few changes in this segment. SO layer eviction/reconstruction policy should be smarter. Right now image layer is not created by checkpoint performed at shutdown (to keep shutdown time reasonably small). I think that we should add more flexible rule for creation of new image layer. It may be triggered by:

  1. Total number of deltas (wal records) for this segment
  2. Maximal length of update chain for page
  3. Time since last image layer creation.
knizhnik commented 2 years ago

I have tried to implement smarter image layer generation policy and started with the simplest one: do not create image layer id size of delta layer is less than 1/2 of segment size (10Mb). It certainly dramatically increase speed of this test: checkpoint takes few seconds instead of 400.

But t the same time it cause a lot of "Timed out while waiting for WAL record" errors which start to happen when duration of checkpoint exceeds 1 minute.

Elimination of such timeouts is one of the primary purposes of introducing back pressure mechanism. But for some reasons it is not considering this lag at all, only flush=dist_consistent_lsn and apply=S3 LSN lags. I have created PR https://github.com/zenithdb/postgres/pull/115 introducing max_prelication_write_lag. But it doesn't help. I have only one explanation for it: without writing image layers, eviction of layer is performed so fast, that other threads (performing get_page_at_lsn) failed to get LayerMap and LayeredTimeline mutex before it reacquired by checkpointer. So during all one minute checkpointer iteration get_page_at_lsn is blocked and timeout expired.

The simplest solution is to replace std mutex with packing_lot mutex, as we already have done in some of our components. It provides more "fair" resource acquisition.

Alternative we can build list of layers to be evicted under mutex and then process this list without holding locks.

knizhnik commented 2 years ago

Yet another problem with delaying write of image layers: right now GC is able to remove old layers only if there is image layer. So, 1 second checkpoint interval, 3 hours of work and ... delete of tenants directory (just rm -fr) takes more than one minute...

bojanserafimov commented 2 years ago

Nice results!

Question on your "Applying 10Mb of WAL to reconstruct this page will take too much time." comment: Do we ever need to reconstruct just a single page? AFAIK single-page requests should be coming mostly from the compute leader, so they should be at latest LSN, which we have an image for. Requests for old LSNs will come from PITR, but we might as well reconstruct entire segments for that workload. Applying 10MB of WAL to get an entire segment doesn't seem bad. The third case is maybe requests from read-only replicas, but let me know if we've already had conversations on that topic before I start guessing how we might run them.

Another solution is maybe a "partial image" layer, which holds images for just a few pages on the segment (along with a header describing which pages it holds). The difference between a delta and a partial image is that when you're reading the busy page on a segment, you only need to find the latest partial image, not all of them since last full image. This effectively simulates temporarily having segment_size = page_size when it makes sense. I assume in general we don't want segment_size = page size because filesystem metadata overhead?

knizhnik commented 2 years ago

Do we ever need to reconstruct just a single page? Why do we need to reconstruct single page? At least because SMGR API is dealing with pages and compute node request a page from pageserver. Definitely it doesn't mean that we can not reconstruct the whole segment at once. But what benefits we will get from it? The problem is applying 10Mb of WAL records is long delay for get_page_at_lsn execution and as a result, long time of query execution. The that we have reconstructed not the single page, but the whole segment will not help to redude this delay. More over, with the current implementation of page image reconstruction (using WAL redo postgres instance, receive queries through the pipe), there will be no large difference between execution of N request to materialize page and one request to materialize N pages.

So from my point of view, eager materialization of the whole segment will not help us to reduce page access time and may even have negative impact on performance (if segment pages are updated non uniformly).

Now concerning different workloads. There is OLAP workload, where pages are accessed mostly sequentially and when some kind of bulk load (request the whole segment instead of individual pages) may be very useful. And there is OLTP workload when we most frequently request random B-Tree page and do not need its neighbors. And yes, compute leader request page at latest LSN, but this page can be actually very old (updated long time ago).

Actually I do not completely understand your proposal to reconstruct not the single page, but the whole segment, because ... it is exactly what we are currently doing when evict in-memory layer: create delta layer and image layer. And exactly because of it we get write amplification in case of skews of page access. I try to find out some criteria we can use to postpone image layer creation. But as I have mentioned above, there seems to be a lot of problems.

bojanserafimov commented 2 years ago

Misunderstanding: By "reconstruct" I meant "compute page by redoing wal, not reading from image". Sorry if I'm overloading an existing term.

The pageserver technically needs to serve all GetPage@LSN requests, even sparse workloads at LSN != latest. But seems like no service is actually making these requests. We're either getting: a) requests at latest LSN, or b) requests with high locality (pages in large contiguous blocks)

For (a), performance is good because we're not doing any wal redo. We have images. Even if the page is very old, if it's the latest one we have an image on disk.

For (b), we compute entire segments. If that's what we're doing now, then no change necessary. We're redoing 8kb of wal per page amortized. This should only happen during PITR. If we need to further optimize it, there are ways.

I'll get to your storage concerns later. Also I haven't fully specified how to achieve k=1 and also "image for all segments on disk" at the same time. I'll get to that later. Now I'm focused on the following point: "we can have bad performance for non-local GetPage@LSN workload if LSN is not latest, because this workload doesn't exist" Let me know if this is not the case. I'm new to the team, I don't know everything :)

knizhnik commented 2 years ago

Misunderstanding: By "reconstruct" I meant "compute page by redoing wal, not reading from image". Sorry if I'm overloading an existing term.

Sorry, if I was unclear, but by "reconstruct" I also mean applying WAL records create final page image.

For (a), performance is good because we're not doing any wal redo. We have images.

We do not get images at page server (except FPI). We receive WAL and has to apply it at page server to reconstruct page. The question is how frequently we should do it... We can materialize page immediately after receiving each WAL record. But in this case we have to overwrite page many times. Or we can delay materialization, hoping that just updated page will tno be requested by compute node soon. Ideally we should collect large enough set of WAL records to be applied, but reconstruct page in background (by checkpointer thread) before it will be requested by compute node. In this case there will be no get_page_at_lsn delay.

knizhnik commented 2 years ago

May main concern now is that with max_replication_flush_lag=0 above test never completes successfully (terminated by page server returned error: Timed out while waiting for WAL record after some iterations), even through I have specified max_replication_write_lag=1MB. This back pressure tunning seems to be very obscure....

knizhnik commented 2 years ago

So write amplification is ~500 times. Final write amplification after finishing all 100 iteration of the test is much smaller: "just" 50 times:

table size: 100 Gb generated WAL: 280 Gb disk write IO: 14 Tb

I think it is because of random updates and autovacuum, which periodically update the whole table.

knizhnik commented 2 years ago

This trace illustrate problem with large number of files. Checkpoint is performed each 1 second, but new image layer is created only if total size of deltas exceeds half of segment size (10Mb). So cause create of large number of small delta layers:

[ec2-user@ip-172-31-24-242 ~]$ time ls -l /data/zenith/test_output/test_write_amplification/repo/tenants/187e2ae40c5d494b86918838419ac904/timelines/f1080e663a20916c897d6ff2f26acfda/ | wc -l
1784267

real    0m36.851s
user    0m25.106s
sys 0m12.033s

According to pageserver.log, GC takes 7 seconds just to traverse all files:

2022-01-09 16:49:31.81  INFO [utils.py:75] GC duration 7415 ms
2022-01-09 16:49:31.82  INFO [utils.py:78]   REL    total: 1784297, needed_by_cutoff 323, needed_by_branches: 0, not_updated: 1783832, needed_as_tombstone 101, removed: 11, dropped: 30
knizhnik commented 2 years ago

Loading data in the same table in Vanilla Postgres takes only 21 minutes (vs. 130 for Zenith). So here difference between Vanilla and Zenith is 6.5 times. As I wrote above, such difference most likely is explained by presence of index building of which requires reading of the whole table.

bojanserafimov commented 2 years ago

@kelvich I'd like to close the "Think of any changes" subtask since I think the LSM solution with a few modifications is the right way to go. My final thoughts (after many conversations with @heikki):

Let's start with the LSM tree solution (as proposed here https://github.com/zenithdb/zenith/issues/936). The worst case scenario for that solution is a 1TB wal that's writing to one single page. Answering GetPage@LSN for that page would require replaying 1TB of wal, which is not good. Read amplification for GetPage@LSN is unbounded.

So let's patch this solution by inserting an 8KB page every 8KB/k of wal, for each page individually. This increases write amplification by an additional factor of k, but places an upper bound on read amplification.

We can play with the numbers but let's analyze the case with fanout = 4 and k = 1. With fanout 4, 1TB is stored in 6 layers (because 1 + log_4(1TB / 1GB) = 6).

Disk bloat is k + 1 = 2.

Write amplification is k * num_layers = 12. This seems big but as long as it's below SSD_speed/network_speed we should be good. That number is 20 AFAIK so we have some wiggle room.

To analyze read amplification, let's look at 2 cases: a) Overhead during recovery from s3 b) Overhead for GetPage@LSN requests

During s3 startup/recovery we need to download num_layers fanout 1GB = 24GB to get our first page. If the network is the bottleneck that's about 1 sec to get the first page and 40 sec to restore full 1TB dataseservice (without sharding).

For GetPage@LSN overhead we have 1 + 1/k = 2 read amplification for wal redo, and some extra overhead to find the relevant entries from the LSN. We need to look inside num_layers * fanout = 24 files. It's not bad, but with some indexing this can be reduced even further down to 1 for the common case.

Stepping back, if we compare this system against vanilla replicated postgres, here are the differences in performance I'd expect:

  1. I'm not worried about handling write throughput because the network will be the bottleneck in our case, and in the replicated vanilla postgres case as well.
  2. We can afford to disable checkpointing on the compute node so we might gain performance over vanilla postgres on some workloads.
  3. I'm not worried about s3 restore. If you store wal + checkpoints of vanilla postgres on s3 you'll have to download the same amount of data. We might perform better because we can partially restore what's needed first.
  4. I think the only thing (theoretically) left to optimize is read performance of GetPage@LSN. A vanilla postgres would just read from disk, but we have to take a network roundtrip and maybe some wal redo. The wal redo part might be possible to eliminate in 99% of the cases (when LSN=latest) but the network roundtrip is unavoidable. We can avoid a network roundtrip with some on-disk caching on the compute node, but there will always be a workload that misses that cache, like a large table scan. So maybe it's worth adding a perf test for this.
knizhnik commented 2 years ago

Sorry, may be should organize some meeting about proposed improvement of the current layered approach, because frankly speaking I do not completely understand it. First of all 1Tb of WAL updating the single page never was the worst case. Checkpointer periodically (each 256Mb if it is fast enough) will produce full page images, so pageserver never needs to apply 1Tb of WAL during get_page_at_lsn. Also Postgres itself (compute node) will help to prevent such scenrio: periodically it will do checkpoints (do not mix with checkpoints performed by page server) and produce FPI.

But I still do not understand how you are going to eliminate write amplification and large number of produce files for the scenario where just one page of segment is once updated. So we have in-memory layer containing just one WAL record. But we can not keep this layer open forever: pageserver needs to provide data durability and advance disk consistent LSN. Right now we will create delta layer (very small in this case ) and image layer (10Mb). We can postpone creation of image layer but it doesn't prevent problem with large number of small delta layers files. AndGC will not be able to collect them until image layer is constructed.

So may be it will be more efficient and convenient to discuss it by voice...

bojanserafimov commented 2 years ago

Very rough summary of our meeting: The biggest discussion point was that the LSM solution would be doing non-sequential disk reads during GetPage@LSN, which would result in (5x?) slower performance on cache miss than we have now. See Figure 3 here for reference on SSD random read performance https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf

kelvich commented 2 years ago

@kelvich I'd like to close the "Think of any changes" subtask since I think the LSM solution with a few modifications is the right way to go. My final thoughts (after many conversations with @heikki):

Yeah, +1 on stopping activity here. I think with https://github.com/zenithdb/zenith/issues/1076 and https://github.com/zenithdb/zenith/issues/1116 getting merged we should close that issue. And any other bloat/perf discussion can be continued in https://github.com/zenithdb/zenith/issues/936