neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

Tenant detach error: tenant directory doesnt exist #4759

Closed LizardWizzard closed 9 months ago

LizardWizzard commented 1 year ago

Steps to reproduce

Not entirely sure about that. But in logs It can be seen that in both cases there is this sequence:

  1. create timeline call that times out and continued in the background while control plane is disconnected
  2. Tenant detach call arrives when timeline is still being created
  3. Tenant detach fails because tenant directory doesnt exist (what is confusing is that I think at this point tenant directory should exist so the source of the error needs investigation)
  4. Tenant detach succeds despite the fact that on the previous step the directory was missing
  5. Timeline creation handler finishes

Failures in timeline create with a message that uininit mark already exist are related to this issue. If timeline creation is retried when previous call timed out but execution is still continued in the background then this error is the expected outcome.

Note that this appeared around restart. Not sure if this is important, but may be related. At least it may be the cause of timeline create request timing out

Expected result

No errors.

I think there are two problems. First there is a race between timeline creation and tenant detach. This is wrong. Second, detach shouldnt fail when tenant directory is missing. This I expect to be fixed in https://github.com/neondatabase/neon/issues/4250. I have a WIP version of that, will link a PR once I open it.

Actual result

ERROR request{ path=/v1/tenant/X/detach request_id=X}: Error processing HTTP request: InternalServerError(Failed to run cleanup for tenant X
Caused by:
    0: local tenant directory "/storage/pageserver/data/tenants/X" removal
    1: No such file or directory (os error 2)

Environment

prod

Logs, links

koivunej commented 1 year ago

Sounds like a problem we can solve to some level with more utils::completion :) (I think this pattern is repeating.)

Basically we should only be able to create new timelines if let Some(guard) = self.timeline_creation.lock().unwrap_or_else(|e| e.into_inner()).clone() { given a Tenant::timeline_creation: Mutex<Option<completion::Completion>>.

On detach we should tenant.timeline_creation.lock().unwrap_or_else(|e| e.into_inner()).take(); and then await on the associated Tenant::timeline_creations_done: completion::Barrier to make sure everything is done. Unsure how the relation to should go, it'd be at least natural to forbid creating new timelines when Tenant::shutdown has started.