future layer deletion can race with re-creation

arpad-m commented 11 months ago

See this thread on slack.

a future image layer is created and uploaded
ps restart
the future layer from (1) is deleted during load layer map
image layer is re-created and uploaded
deletion queue would like to delete (1) but actually deletes (4)
- delete by name works as expected, but it now deletes the wrong (later) version

deletion can be delayed and might happen after re-creation has uploaded the layer file. Then, later on, the file is gone.

Another thread with more illustrative screenshot: https://neondb.slack.com/archives/C0660LJT22J/p1700230109306539

see comment below for what's going on.

cc @koivunej

koivunej commented 11 months ago

Uploads need to make sure there is no pending deletion for the name. Easy solution would be to delete a specific object version but we do not track those. I can't really see a quick fix. A test case would be easy to do by:

create tenant + timeline
force image layer creation
add any data via endpoint
detach
attach
- now future image layer has scheduled delete
force image layer creation

problame commented 11 months ago

Summarizing today's Slack discussion

The symptom is the layer file that's referenced from index part but actually missing in the remote storage.

The intermediate cause is what @arpad-m pointed out in the initial comment, i.e., the DELETE from load_layer_map and the PUT of the re-creating compaction get incorrectly reordered.

The root cause for that reordering is not a bug in the deletion queue, nor the persistence of the deletion queue. One might get that impression from the initial Slack thread. It's wrong. We don't have generation numbers enabled in prod yet, so, the deletion is neither deferred nor is the queue persistent across process restarts.

The real bug is one level higher, namely, in how the remote_timeline_client / upload_queue: the upload queue allows DELETEs and PUTs to run concurrently. It is the user's job to insert a barrier if they operate on the same keys and a certain order is required.

So, the solution here is to insert a barrier between the deletion and re-creation, so that deletion is guaranteed to happen before re-creation.

Here's an illustration of the upload queue that, depending on execution order, can lead to the symptom:

neondatabase / neon

future layer deletion can race with re-creation #5878