neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.94k stars 436 forks source link

tiered compaction: duplicated L1 layer error in test_deletion_queue_recovery #7707

Open arpad-m opened 5 months ago

arpad-m commented 5 months ago

Running the test_deletion_queue_recovery or test_uploads_and_deletions tests with tiered compaction enabled gives "duplicated L1 layer" errors:

2024-05-10T22:25:57.275644Z ERROR request{method=PUT path=/v1/tenant/b8c7c9b6739fed9060bfdf938ec9e9dc/timeline/5ffef4897699e4ff0fd68add218821ea/checkpoint request_id=227fdbbe-247c-4aee-8545-524487816dc4}:manual_checkpoint{tenant_id=b8c7c9b6739fed9060bfdf938ec9e9dc shard_id=0000 timeline_id=5ffef4897699e4ff0fd68add218821ea}: duplicated L1 layer layer=000000067F00000005000000000000000001-030000000000000000000000000000000002__0000000001535489-000000000154E229-00000001
2024-05-10T22:25:57.275660Z ERROR request{method=PUT path=/v1/tenant/b8c7c9b6739fed9060bfdf938ec9e9dc/timeline/5ffef4897699e4ff0fd68add218821ea/checkpoint request_id=227fdbbe-247c-4aee-8545-524487816dc4}:manual_checkpoint{tenant_id=b8c7c9b6739fed9060bfdf938ec9e9dc shard_id=0000 timeline_id=5ffef4897699e4ff0fd68add218821ea}: duplicated L1 layer layer=000000067F00000005000040000000000001-030000000000000000000000000000000002__0000000001535489-000000000154E229-00000001

visible with the following diff of test_deletion_queue_recovery:

-    env = neon_env_builder.init_start(initial_tenant_conf=TENANT_CONF)
+    tenant_conf = TENANT_CONF
+    tenant_conf["compaction_algorithm"] = '{{"kind": "Tiered"}}'
+    env = neon_env_builder.init_start(initial_tenant_conf=tenant_conf)

The test_deletion_queue_recovery test ran into all the important issues: previously, it ran into #7244 and #7296.

part of #7554

### Tasks
- [ ] https://github.com/neondatabase/neon/pull/7758
- [ ] fix the issue
- [ ] remove the `allowed_errors` added in #7758
jcsp commented 5 months ago

The test_deletion_queue_recovery test ran into all the important issues: previously, it ran into https://github.com/neondatabase/neon/issues/7244 and https://github.com/neondatabase/neon/issues/7296.

Can we lift the subset of this test that reproduces these issues into a dedicated compaction test, perhaps as part of the PR fixing this issue?

arpad-m commented 5 months ago

Can we lift the subset of this test that reproduces these issues into a dedicated compaction test

I could file a PR and then just allow the duplicated L1 layer errors.

arpad-m commented 5 months ago

I could file a PR and then just allow the duplicated L1 layer errors.

Done: #7758

problame commented 5 months ago

Meeting notes:

Action item: arpad & heikki to understand why it happens.