Open jcsp opened 2 weeks ago
This was an issue with storage controller handling a Detached tenant, which we don't currently do in the field
Test does the following in a loop:
When we detach we don't update the compute hook state. It still points to the detached pageservers. When the first shard finishes its location config it goes ahead with re-configuring the compute with a mixture of new state (for the shard that was just reconciled) and old state. Compute tries to prefetch something as part of reconfiguration and we get a deadlock. In prod the cplane database acts as a buffer to mask this eventual consistency.
We can fix this by updating hook state on detach.
An interesting question that arises from this is: "Should we notify cplane about detaches?". It complicates the interaction between services, but ensures that a compute can't send requests with a stale pageservers list.
Since 4th Sep
https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8681/10718968862/index.html#testresult/9177f6b50b1cbc31/retries