neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.28k stars 408 forks source link

pageserver: during compaction, write image layers if it will enable physical space reduction #6895

Open jcsp opened 6 months ago

jcsp commented 6 months ago

Background

The gc_feedback mechanism removed in https://github.com/neondatabase/neon/pull/6863 is meant to protect against edge cases where repeated keyspace repartitioning can result in stacks of deltas that are never fully covered by image layers, and therefore never get GC'd.

The history as I understand it is:

Purpose

This ticket tracks creating an improved mechanism to ensure that:

  1. Long-idle timelines are proactively compacted into image layers to reduce storage space.
  2. Edge case "gaps" in image layer coverage in compaction do not result in keeping old delta layers forever.
  3. Such proactive image layer generation must not result in non-root timelines copying large proportions of the parent timeline's data (i.e. preserve CoW behavior).
  4. Proactive image layer generation should not closely track the GC horizon, to avoid continuously generating new image layers as the GC horizon advances. It should also not continuously generate image layers if someone sets the pitr interval to 0.

The previous gc_feedback mechanism was not widely used because it satisfied 1 & 2 but not 3 & 4.

A replacement mechanism might not need to involve the GC code -- we can directly query the layer map during compaction and:

problame commented 6 months ago

For posterity, Konstantin wrote a concise summary of the edge case that John mentions in the issue description (ake "staircase pattern")

https://neondb.slack.com/archives/C033RQ5SPDH/p1709210208901109?thread_ts=1708990928.754019&cid=C033RQ5SPDH

Sorry, it is not so easy for me to interpret this picture. But at first glance it seems to be classical "stairs problem". Just wan to remember: what "stairs problem" mean:

  • GC is able to remove layer if it is fully covered by image layers.
  • Image layer is generated if there are at least 3 (or 6?) delta layers between it and underlying image layer
  • Boundaries of L1 layers are completely flexible - it depends only on physical layers size.

So it can happen the start position of each new generated L1 layers is shifted a little bit compatring with position of previous L1 layer. It can naturally happen if we just append data to som table, so that changed pages are at the end of relation. Such stair can have arbitrary height and never be fully covered by image layers. This is what my "gc-feedback" mechanism tries to address. But it was never tested on reall projects and now it is just removed (because not used).

jcsp commented 5 months ago

Once we have image layer compression, we might decide that we want to unconditionally replace deltas with image layers on some time cadence (e.g. PITR interval) in order to benefit from compression. That might simplify this ticket.

arpad-m commented 5 months ago

I'm not sure if replacing all delta layers is the best idea as we want to preserve the CoW property of branching, but of course we can't hold onto delta layers unconditionally outside of the PITR interval.