Closed LizardWizzard closed 9 months ago
Sounds like a problem we can solve to some level with more utils::completion
:) (I think this pattern is repeating.)
Basically we should only be able to create new timelines if let Some(guard) = self.timeline_creation.lock().unwrap_or_else(|e| e.into_inner()).clone() {
given a Tenant::timeline_creation: Mutex<Option<completion::Completion>>
.
On detach we should tenant.timeline_creation.lock().unwrap_or_else(|e| e.into_inner()).take();
and then await on the associated Tenant::timeline_creations_done: completion::Barrier
to make sure everything is done. Unsure how the relation to should go, it'd be at least natural to forbid creating new timelines when Tenant::shutdown
has started.
Steps to reproduce
Not entirely sure about that. But in logs It can be seen that in both cases there is this sequence:
Failures in timeline create with a message that uininit mark already exist are related to this issue. If timeline creation is retried when previous call timed out but execution is still continued in the background then this error is the expected outcome.
Note that this appeared around restart. Not sure if this is important, but may be related. At least it may be the cause of timeline create request timing out
Expected result
No errors.
I think there are two problems. First there is a race between timeline creation and tenant detach. This is wrong. Second, detach shouldnt fail when tenant directory is missing. This I expect to be fixed in https://github.com/neondatabase/neon/issues/4250. I have a WIP version of that, will link a PR once I open it.
Actual result
Environment
prod
Logs, links