We saw a scale test failure when one shard went secondary->attached->secondary in a short period of time -- the metrics for the shard failed a validation assertion that is meant to ensure the size metric matches the sum of layer sizes in the SecondaryDetail struct.
This appears to be due to two SecondaryTenants being alive at the same time -- the first one was shut down but still had its contributions to the metrics.
Refactor code for validating metrics and call it in shutdown as well as during downloads
Move code for dropping per-tenant secondary metrics from drop() into shutdown(), so that once shutdown() completes it is definitely safe to instantiate another SecondaryTenant for the same tenant.
Problem
We saw a scale test failure when one shard went secondary->attached->secondary in a short period of time -- the metrics for the shard failed a validation assertion that is meant to ensure the size metric matches the sum of layer sizes in the SecondaryDetail struct.
This appears to be due to two SecondaryTenants being alive at the same time -- the first one was shut down but still had its contributions to the metrics.
Closes: https://github.com/neondatabase/neon/issues/9628
Summary of changes