pageserver: accounting error on secondary tenant resident size

VladLazar commented 2 weeks ago

This testing assertion fired during test_storage_controller_many_tenants on PR https://github.com/neondatabase/neon/pull/8613 (alure). There's a Slack thread with a bit of context here.

Spent some time looking at it, but didn't spot the issue. It happened shortly after a live migration and the error makes me think we somehow counted the only existing layer twice.

2024-11-01T15:41:55.639814Z ERROR secondary_download{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}:panic{thread=background op worker location=pageserver/src/tenant/secondary/downloader.rs:778:13}: assertion `left == right` failed
  left: 2605056
 right: 5210112

<backtrace snipped>

2024-11-01T15:41:55.873307Z ERROR secondary_download_scheduler:panic{thread=background op worker location=pageserver/src/tenant/secondary/scheduler.rs:307:36}: Panic in background task: JoinError::Panic(Id(1360288), ...)

<backtrace snipped>

2024-11-01T15:41:55.883651Z ERROR Task panicked, exiting process: Any { .. } task_name="secondary tenant downloads"

One observation was that the panic completely killed the pageserver process. This was unexpected to me, so panic handling on that code path should be checked as well.

Todo:

[ ] Root cause the bug
[ ] Fix
[ ] Check that panic handling is correct on the secondary download code path

jcsp commented 7 hours ago

This is an artifact of secondary tenant objects (and their RAII contribution to the metric) living too long. The sequence in the test is:

2024-11-01T15:37:54.313106Z  INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=3f096656-7248-4a2f-9a6c-4957b226859e}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Secondary, warm=true
2024-11-01T15:40:46.655306Z  INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=6c7b32e7-5542-470a-80e4-cac013fdb37f}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Attached Multi, gen=00000002
2024-11-01T15:40:49.024330Z  INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=4d131964-e49c-4adb-90de-02d3b53b1dfd}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Attached Single, gen=00000002
2024-11-01T15:40:51.505073Z  INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=8c6fe1ed-fb57-4629-828d-16507b68b3a6}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Attached Stale, gen=00000002
2024-11-01T15:40:51.652909Z  INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=302a00ee-7af6-44ca-87f4-1532b90d5cdb}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Secondary, warm=true

Right before the assertion failure, we see a warning for the runtime of a secondary download that means it must have started before the last time we configured the tenant to secondary (i.e. the download task from the original secondary mode tenant was still running):

2024-11-01T15:40:54.550549Z  INFO secondary_download{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: task iteration took longer than the configured period elapsed=39.025312404s period=10s task=SecondaryDownload

This is peculiar, because SecondaryTenant::shutdown does block on a gate held by TenantDownloader::download

jcsp commented 6 hours ago

The scheduler for secondary downloads is allowed to keep Arc after shutdown() is done, but the erase of the metrics was done in drop. We can move it to shutdown() to get a cleaner behavior.

neondatabase / neon

pageserver: accounting error on secondary tenant resident size #9628