neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.4k stars 418 forks source link

Failures in test_scrubber_physical_gc #8928

Open jcsp opened 2 weeks ago

jcsp commented 2 weeks ago

Since 4th Sep

https://neon-github-public-dev.s3.amazonaws.com/reports/pr-8681/10718968862/index.html#testresult/9177f6b50b1cbc31/retries

jcsp commented 3 days ago

This was an issue with storage controller handling a Detached tenant, which we don't currently do in the field

VladLazar commented 3 days ago

Test does the following in a loop:

When we detach we don't update the compute hook state. It still points to the detached pageservers. When the first shard finishes its location config it goes ahead with re-configuring the compute with a mixture of new state (for the shard that was just reconciled) and old state. Compute tries to prefetch something as part of reconfiguration and we get a deadlock. In prod the cplane database acts as a buffer to mask this eventual consistency.

We can fix this by updating hook state on detach.

An interesting question that arises from this is: "Should we notify cplane about detaches?". It complicates the interaction between services, but ensures that a compute can't send requests with a stale pageservers list.