Open VladLazar opened 2 weeks ago
This is an artifact of secondary tenant objects (and their RAII contribution to the metric) living too long. The sequence in the test is:
2024-11-01T15:37:54.313106Z INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=3f096656-7248-4a2f-9a6c-4957b226859e}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Secondary, warm=true
2024-11-01T15:40:46.655306Z INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=6c7b32e7-5542-470a-80e4-cac013fdb37f}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Attached Multi, gen=00000002
2024-11-01T15:40:49.024330Z INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=4d131964-e49c-4adb-90de-02d3b53b1dfd}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Attached Single, gen=00000002
2024-11-01T15:40:51.505073Z INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=8c6fe1ed-fb57-4629-828d-16507b68b3a6}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Attached Stale, gen=00000002
2024-11-01T15:40:51.652909Z INFO request{method=PUT path=/v1/tenant/11bd770f785975106c00a57e1b60ae2c-0408/location_config request_id=302a00ee-7af6-44ca-87f4-1532b90d5cdb}:upsert_location{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: configuring tenant location to state Secondary, warm=true
Right before the assertion failure, we see a warning for the runtime of a secondary download that means it must have started before the last time we configured the tenant to secondary (i.e. the download task from the original secondary mode tenant was still running):
2024-11-01T15:40:54.550549Z INFO secondary_download{tenant_id=11bd770f785975106c00a57e1b60ae2c shard_id=0408}: task iteration took longer than the configured period elapsed=39.025312404s period=10s task=SecondaryDownload
This is peculiar, because SecondaryTenant::shutdown does block on a gate held by TenantDownloader::download
The scheduler for secondary downloads is allowed to keep Arc
This testing assertion fired during
test_storage_controller_many_tenants
on PR https://github.com/neondatabase/neon/pull/8613 (alure). There's a Slack thread with a bit of context here.Spent some time looking at it, but didn't spot the issue. It happened shortly after a live migration and the error makes me think we somehow counted the only existing layer twice.
One observation was that the panic completely killed the pageserver process. This was unexpected to me, so panic handling on that code path should be checked as well.
Todo: