Closed hlinnaka closed 11 months ago
The storage team talked yesterday on the offsite about putting cancellation tokens everywhere, with a hierarchy for process, tenant, timeline, and request. If the timeline deletion would invoke all timeline specific cancellation tokens, we could resolve this that way.
Reopening because this happened again in production, see https://neondb.slack.com/archives/C06345636RG/p1698416136263129 (private slack, 20min wait in ignore in waiting for serving compute connection task to shut down
).
Identified https://github.com/neondatabase/neon/blob/c13e932c3bd8096669619326ab8decdfd2ffca20/libs/postgres_backend/src/lib.rs#L402 as possible culprit and this as a possible chain of events:
As a remedy, we should just not try to do a clean shutdown in case of QueryError::Shutdown. Alternatively we could wrap that ~flush in pq_proto (linking soon)~ shutdown with a tokio::select!
as well. In case of cancellation, it would be ready immediatedly, which is fine and what we want.
removing the /triaged
label so this doesn't slip
This happened on production a couple of times now:
The situation resolves eventually, when the TCP connection times out, but that can take a very long time. Worse, if the client is still present and simply isn't reading the data the pageserver sends it, it will be stuck forever.