timescale / timescaledb-docker-ha

Create Docker images containing TimescaleDB, Patroni to be used by developers and Kubernetes.
Apache License 2.0
159 stars 44 forks source link

SIGINT not honored by Patroni anymore #499

Open talpa-robin opened 2 hours ago

talpa-robin commented 2 hours ago

Hey Team :)

We're using the image tag timescale/timescaledb-ha:pg13.16-ts2.15.3 with Patroni and since the last update to that tag (which came with an upgrade to Patroni 4.0.2 and the STOPSIGNAL change from SIGTERM to SIGINT, see https://github.com/timescale/timescaledb-docker-ha/issues/492) the "delete/stop" commands from Kubernetes don't lead to a graceful shutdown anymore. Within the pods / processes nothing happens and then it's forcefully killed after the terminationGracePeriodSeconds.

Here are the relevant processes running inside the pod for a replica of a three node HA setup

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
postgres       1  0.0  0.2  50212 34904 ?        Ss   10:00   0:00 /usr/bin/python3 /usr/bin/patroni /etc/timescaledb/patroni.yaml
postgres      15  0.1  0.2 583492 37044 ?        Sl   10:00   0:18 /usr/bin/python3 /usr/bin/patroni /etc/timescaledb/patroni.yaml
postgres     384  0.0  0.8 3786888 129736 ?      S    10:17   0:00 postgres -D /var/lib/postgresql/data --config-file=/var/lib/postgresql/data/postgresql.conf --listen_addresses=0.0.0.0 --po
postgres     386  0.0  3.7 3787292 600200 ?      Ss   10:17   0:09 postgres: xxx-timescaledb-xxx: startup recovering 00000096000032CE00000034
postgres     393  0.0  3.6 3787032 580960 ?      Ss   10:17   0:06 postgres: xxx-timescaledb-xxx: checkpointer 
postgres     394  0.0  0.2 3786888 37240 ?       Ss   10:17   0:00 postgres: xxx-timescaledb-xxx: background writer 
postgres     395  0.0  0.0  73260  9108 ?        Ss   10:17   0:03 postgres: xxx-timescaledb-xxx: stats collector 
postgres     402  0.0  0.1 3790584 28364 ?       Ss   10:17   0:00 postgres: xxx-timescaledb-xxx: postgres postgres [local] idle
postgres     404  0.0  0.2 3791024 32060 ?       Ss   10:17   0:02 postgres: xxx-timescaledb-xxx: postgres postgres [local] idle
postgres    2337  0.0  0.1 3790236 28708 ?       Ss   12:26   0:00 postgres: xxx-timescaledb-xxx: postgres postgres [local] idle
postgres    2349  0.1  0.0 3787784 15152 ?       Ss   12:26   0:11 postgres: xxx-timescaledb-xxx: walreceiver streaming 32CE/34C2B6F8

When sending a SIGINT to PID 1 or 15 nothing happens (also simulated this with kill -s SIGINT <PID>). When looking into the auditd logs of the host machine you see that PID 1 receives the SIGINT but PID 15 doesn't. When sending a SIGTERM everything works as expected.

We first thought it might be a problem with Patroni but the guys over in the Patroni Slack couldn't reproduce it and also our internal tests with this setup https://github.com/patroni/patroni/tree/master/docker confirm, that Patroni works as intended there.

Hope you can help. If you need more information don't hesitate to ask :)

Have a great weekend

graveland commented 2 hours ago

That is odd for sure.... patroni's main.py has:

    signal.signal(signal.SIGINT, passtochild)

and that should be called when patroni is PID == 1