A pageserver client that's not receiving data blocks timeline deletion

hlinnaka commented 1 year ago

This happened on production a couple of times now:

timeline deletion is requested
a compute is connected to the timeline, from pageserver's point of view
in reality, the compute VM has already been killed, but the pagserver still thinks that the TCP connection is open
pagserver's task handling the connection is stuck in a write(), because the compute is not receiving the data anymore and the send buffer is full
as long as the client is connected, the timeline deletion cannot proceed

The situation resolves eventually, when the TCP connection times out, but that can take a very long time. Worse, if the client is still present and simply isn't reading the data the pageserver sends it, it will be stuck forever.

arpad-m commented 1 year ago

The storage team talked yesterday on the offsite about putting cancellation tokens everywhere, with a hierarchy for process, tenant, timeline, and request. If the timeline deletion would invoke all timeline specific cancellation tokens, we could resolve this that way.

koivunej commented 11 months ago

Reopening because this happened again in production, see https://neondb.slack.com/archives/C06345636RG/p1698416136263129 (private slack, 20min wait in ignore in waiting for serving compute connection task to shut down).

Identified https://github.com/neondatabase/neon/blob/c13e932c3bd8096669619326ab8decdfd2ffca20/libs/postgres_backend/src/lib.rs#L402 as possible culprit and this as a possible chain of events:

cancel happens while flushing, QueryError::Shutdown is returned
bubble up to PostgresBackend::run
whatever write was cancelled leading to QueryError::Shutdown is now again awaited on that line

As a remedy, we should just not try to do a clean shutdown in case of QueryError::Shutdown. Alternatively we could wrap that ~flush in pq_proto (linking soon)~ shutdown with a tokio::select! as well. In case of cancellation, it would be ready immediatedly, which is fine and what we want.

problame commented 11 months ago

removing the /triaged label so this doesn't slip

neondatabase / neon

A pageserver client that's not receiving data blocks timeline deletion #5341