neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.98k stars 438 forks source link

pageserver: do image layer creation after timeline creation (or remove the code) #7197

Open jcsp opened 7 months ago

jcsp commented 7 months ago

Background

See: https://github.com/neondatabase/neon/pull/7182#issuecomment-2012100802

In flush_frozen_layer we do this:

        // As a special case, when we have just imported an image into the repository,
        // instead of writing out a L0 delta layer, we directly write out image layer
        // files instead. This is possible as long as *all* the data imported into the
        // repository have the same LSN.
        let lsn_range = frozen_layer.get_lsn_range();
        let (layers_to_upload, delta_layer_to_add) =
            if lsn_range.start == self.initdb_lsn && lsn_range.end == Lsn(self.initdb_lsn.0 + 1) {

This code path isn't taken for normal timeline creations, because although we call freeze_and_flush right after creation, there is a small WAL ingest between ingesting initdb and freezing the layer.

It's mostly harmless to skip this image layer generation, because an L1 layer full of page values is not any less efficient than an image layer full of values. However, if implement compression of image layers (#5913 ) before we attempt compression of image values in delta layers, there's a benefit to writing an image layer for newly created tenants, to reduce the physical size.

Action

We should do one of these two things:

  1. Make it so that we take this image layer generation path after normal timeline creations. This will require updating some tests, especially those that configure a tiny layer count and then make assertions about layer counts.
  2. Or, just remove this dead code, and plan on implementing compression of image values in delta layers, such that the benefit to writing an image layer is almost nil.
koivunej commented 5 months ago

Encountered an s3 recovery related problem in #7927: if we just use the "flush more often" somehow in solving this issue (like it behaves when checkpoint_distance is smaller than initdb size) we will produce 2 index_part.json updates very near one and the other. This means that s3_recovery will not work, and the test case hangs as it's waiting for the WAL part of initdb to arrive for the root timeline.

This failure mode was obscured by a number of things, but mock_s3 and real_s3 both exhibit this behaviour together with stable sort.

It of course only applies to timelines which have never had a compute started up against them. However, the first uploaded index_part.json version is meaningless and inconsistent: we can never recover using safekeepers to that Lsn because pageserver is the only one who had the WAL (and uploaded as initdb.tar.zst).

For importing really large backups, I don't think we can use the normal flush loop at all, we will need to build the image layers directly somehow.. I don't know how to do it in a streaming fashion, because we'd essentially need random access I/O to the whole fullbackup tar to do the repartitioning and splitting into image layers. An okay workaround might be to create arbitrary image layers before the imported lsn so that we can fit the fullbackup and produce "L0 deltas" (which are actually image layers, but this way they'll get to go through the compaction treatment).