Restart BGW if crashed - Githubissues

tbe commented 7 years ago

We have seen some rare crashes of the partman_bgw. They occure under heavy database load.

As worker.bgw_restart_time is set to BGW_NEVER_RESTART the worker will never restart and we have to restart the whole database. As this is a highly productive environment, this is not always possible.

https://github.com/keithf4/pg_partman/blob/71d5851b5820847b4681dcf88c7c4830c56b9a0c/src/pg_partman_bgw.c#L174

keithf4 commented 7 years ago

I'll see if I can figure something out. If you happen to be able to get a core/trace of the crash, I might be able to find the cause.

tbe commented 7 years ago

we get an FATAL error, that the bgw was not able to create the dynamic background worker. At the same time, there were multiple background jobs running by other plugins (powa) and vacuums.

Maybe another solution would be, not to die with a FATAL if a dynamic worker could not created. Just retry this later?

keithf4 commented 7 years ago

Ahh I see. Yeah, I guess it's not playing nicely if you have other background processes that are using up all the workers. I didin't realize it would crash the main BGW by having the dynamic one throw a fatal error when that happens. Below is a patch to restart the main background worker after 10 minutes if it ever crashes as well as only throw an error with a hint if a dynamic bgw cannot be registered. If you have a chance to test this out, I'd appreciate it.

diff --git a/src/pg_partman_bgw.c b/src/pg_partman_bgw.c
index 0207db8..3d862a9 100644
--- a/src/pg_partman_bgw.c
+++ b/src/pg_partman_bgw.c
@@ -171,7 +171,7 @@ _PG_init(void)
     worker.bgw_flags = BGWORKER_SHMEM_ACCESS |
         BGWORKER_BACKEND_DATABASE_CONNECTION;
     worker.bgw_start_time = BgWorkerStart_RecoveryFinished;
-    worker.bgw_restart_time = BGW_NEVER_RESTART;
+    worker.bgw_restart_time = 600;
     #if (PG_VERSION_NUM < 100000)
     worker.bgw_main = pg_partman_bgw_main;
     #endif
@@ -286,7 +286,7 @@ void pg_partman_bgw_main(Datum main_arg) {

                 elog(DEBUG1, "Registering dynamic background worker...");
                 if (!RegisterDynamicBackgroundWorker(&worker, &handle)) {
-                    elog(FATAL, "Unable to register dynamic background worker for pg_partman");
+                    elog(ERROR, "Unable to register dynamic background worker for pg_partman. Consider increasing max_worker_processes if you see this frequently.");
                 }

                 elog(DEBUG1, "Waiting for BGW startup...");

keithf4 commented 7 years ago

I tested the above out by setting max_worker_processes = 1 so that there's no more left for the dynamic one to start and the crash recovery after 10 minutes works. I'll have this fix in the next release. Thanks for letting me know!

keithf4 commented 7 years ago

Version 3.0.2 has been released with a fix for this issue. Will require a restart of the cluster to load the new library, but that should then solve a crash from requiring any further restarts.

Thanks for reporting the issue!

pgpartman / pg_partman

Restart BGW if crashed #180