neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

future layer deletion can race with re-creation #5878

Closed arpad-m closed 11 months ago

arpad-m commented 11 months ago

See this thread on slack.

  1. a future image layer is created and uploaded
  2. ps restart
  3. the future layer from (1) is deleted during load layer map
  4. image layer is re-created and uploaded
  5. deletion queue would like to delete (1) but actually deletes (4)
    • delete by name works as expected, but it now deletes the wrong (later) version

deletion can be delayed and might happen after re-creation has uploaded the layer file. Then, later on, the file is gone.

Another thread with more illustrative screenshot: https://neondb.slack.com/archives/C0660LJT22J/p1700230109306539

see comment below for what's going on.

cc @koivunej

koivunej commented 11 months ago

Uploads need to make sure there is no pending deletion for the name. Easy solution would be to delete a specific object version but we do not track those. I can't really see a quick fix. A test case would be easy to do by:

  1. create tenant + timeline
  2. force image layer creation
  3. add any data via endpoint
  4. detach
  5. attach
    • now future image layer has scheduled delete
  6. force image layer creation
problame commented 11 months ago

Summarizing today's Slack discussion

The symptom is the layer file that's referenced from index part but actually missing in the remote storage.

The intermediate cause is what @arpad-m pointed out in the initial comment, i.e., the DELETE from load_layer_map and the PUT of the re-creating compaction get incorrectly reordered.

The root cause for that reordering is not a bug in the deletion queue, nor the persistence of the deletion queue. One might get that impression from the initial Slack thread. It's wrong. We don't have generation numbers enabled in prod yet, so, the deletion is neither deferred nor is the queue persistent across process restarts.

The real bug is one level higher, namely, in how the remote_timeline_client / upload_queue: the upload queue allows DELETEs and PUTs to run concurrently. It is the user's job to insert a barrier if they operate on the same keys and a certain order is required.

So, the solution here is to insert a barrier between the deletion and re-creation, so that deletion is guaranteed to happen before re-creation.

Here's an illustration of the upload queue that, depending on execution order, can lead to the symptom:

image