neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 429 forks source link

re-assess severity of duplicate layers: nowadays it cannot happen & we should panic/abort() early if they do #7790

Closed problame closed 4 months ago

problame commented 5 months ago

Background

From the time before always-authoritative index_part.json, we had to handle duplicate layers. See the RFC for an illustration of how duplicate layers could happen: https://github.com/neondatabase/neon/blob/a8e6d259cb49d1bf156dfc2215b92c04d1e8a08f/docs/rfcs/027-crash-consistent-layer-map-through-index-part.md?plain=1#L41-L50

As of #5198 , we should not be exposed to that problem anymore.

Problem 1

But, we still have

  1. code in Pageserver than handles duplicate layers
  2. tests in the test suite that demonstrates the problem using a failpoint

However, the test in the test suite doesn't use the failpoint to induce a crash that could legitimately happen in production. What is does instead is to return early with an Ok(), so that the code in Pageserver that handles duplicate layers (item 1) actually gets exercised.

That "return early" would be a bug in the routine if it happened in production. So, the tests in the test suite are tests for their own sake, but don't serve to actually regress-test any production behavior.

Problem 2

Further, if production code did (it nowawdays doesn't!) create a duplicate layer, I think the code in Pageserver that handles that condition (item 1 above) is too little too late:

Soution

Concern originally raised in https://github.com/neondatabase/neon/issues/7707#issuecomment-2112743857

koivunej commented 5 months ago

The analysis is based on the early exit which tests that compaction doesn't go into a loop where L0 compaction returns the same L0 in https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L15

I've since added a test case which actually tests the known duplication situation experienced with test_pageserver_chaos: https://github.com/neondatabase/neon/blob/d9dcbffac37ccd3331ec9adcd12fd20ce0ea31aa/test_runner/regress/test_duplicate_layers.py#L43

Your PR removes both test cases, but this issue only discusses the first one.

I do agree with problem 2 and the lateness. However the known duplicated situation with test_pageserver_chaos seemed to require a restart or has not been ran into otherwise (the runtime panic). Agreed we could read bad data since the actual switch on disk would had happened, but at least we would not have uploaded it to s3. Current setup was left to create noise (panic) in case we ever ran into this problem, and we haven't, before the tiered compaction work.

RENAME_NOREPLACE

I was thinking of link + unlink earlier on internal slack thread but yes, this would be simpler (I assume you did it via nix, link + unlink would have been via std). I am still however unconvinced we want to abort; I'd just add this hardening, keep the test demonstrating this behavior and fix the bug in tiered compaction. I guess I need to do a competing PR.