neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

A pageserver client that's not receiving data blocks timeline deletion #5341

Closed hlinnaka closed 11 months ago

hlinnaka commented 1 year ago

This happened on production a couple of times now:

The situation resolves eventually, when the TCP connection times out, but that can take a very long time. Worse, if the client is still present and simply isn't reading the data the pageserver sends it, it will be stuck forever.

arpad-m commented 1 year ago

The storage team talked yesterday on the offsite about putting cancellation tokens everywhere, with a hierarchy for process, tenant, timeline, and request. If the timeline deletion would invoke all timeline specific cancellation tokens, we could resolve this that way.

koivunej commented 11 months ago

Reopening because this happened again in production, see https://neondb.slack.com/archives/C06345636RG/p1698416136263129 (private slack, 20min wait in ignore in waiting for serving compute connection task to shut down).

Identified https://github.com/neondatabase/neon/blob/c13e932c3bd8096669619326ab8decdfd2ffca20/libs/postgres_backend/src/lib.rs#L402 as possible culprit and this as a possible chain of events:

  1. cancel happens while flushing, QueryError::Shutdown is returned
  2. bubble up to PostgresBackend::run
  3. whatever write was cancelled leading to QueryError::Shutdown is now again awaited on that line

As a remedy, we should just not try to do a clean shutdown in case of QueryError::Shutdown. Alternatively we could wrap that ~flush in pq_proto (linking soon)~ shutdown with a tokio::select! as well. In case of cancellation, it would be ready immediatedly, which is fine and what we want.

problame commented 11 months ago

removing the /triaged label so this doesn't slip