neondatabase / neon

Neon: Serverless Postgres. We separated storage and compute to offer autoscaling, code-like database branching, and scale to zero.
https://neon.tech
Apache License 2.0
14.78k stars 430 forks source link

task_mgr::spawn's shutdown_process_on_error` doesnt work reliably [P:3] [S:0] #3402

Closed LizardWizzard closed 7 months ago

LizardWizzard commented 1 year ago

Steps to reproduce

Was discovered via https://github.com/neondatabase/neon/issues/3387

Apparently synthetic size calculation task had an error that triggered shut down. Synthetic size calculation shouldnt lead to pageserver shutdown, and this is fixed in #3392. But shutdown on error should still work even if its triggered erroneously. This is what this issue all about.

Expected result

pageserver restart.

Actual result

pageserver was stuck in semi-alive state when some of the tasks were stopped and some continue running. Postgres protocol listener was shut down so this resulted in connection refused errors during basebackups.

Environment

prod.

Logs, links

shanyp commented 1 year ago

consider having a timeout for this one

jcsp commented 7 months ago

Fixed in https://github.com/neondatabase/neon/pull/6105 -- we now exit(1) if a shutdown_process=true case.