what happens when temporarily loosing replicas

aminebt commented 3 months ago

This is a question whether the behavior we observe in k8s is the expected one, otherwise if there's a misconfiguration on our side.

We're using pgpool in front of a zalando-patroni managed postgres cluster. The cluster is made of a leader (exposed through k8s service "postgres-db-pg-cluster") and two replicas (exposed through k8s service "postgres-db-pg-cluster-repl"). This question applies even if there is a single replica pod.
we're pretty much using the configuration described in this repo.
sometimes, incidents occur and replication is broken between leader pod and replica(s), which become unhealthy. therefore postgres-db-pg-cluster-repl doesn't have endpoints anymore (they fail liveness probes).

Observed behavior: while replicas are down, all postgres clients report errors connecting to postgres (through pgpool), despite the leader postgres pod being healthy.
Expected behavior: given postgres leader is still up and running, I was expecting pgpool to accept client connections and use the remaining healthy backend. How to reproduce it : we can simulate this situation based on any similar set-up on k8s and tampering with the replica service selector so that it doesn't have any endpoints (this is equivalent to replica pods being unhealthy and therefore excluded from the replica service).

Remark: the expected behavior can be achieved if I set those two parameters to the values below : backend_flag1 = 'ALLOW_TO_FAILOVER'
failover_on_backend_error = 'on' but in that case when replicas are back to a healthy state, the corresponding backend is not reattached, even if auto_failback is set to on (but that perhaps is a different story/question about sr_check and application_name, etc.).

appreciate your help and clarifications on this.

pengbo0328 commented 3 months ago

Observed behavior: while replicas are down, all postgres clients report errors connecting to postgres (through pgpool), despite the leader postgres pod being healthy.

In this case Pgpool-II should send all queries to primary.

Could you show your pgpool settings and logs?

aminebt commented 3 months ago

attached :

the logs : covering a period before and after simulating the replicas crash.
the config as shown by "pgpool show all"
the config provided in pgpool.conf (note: we're using 5432 port for pgpool itself)

from what I can see, after we loose the replicas we start repeatedly getting the messages: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer"

the same behavior and log messages above are seen when setting : sr_check_period = 0
failover_on_backend_shutdown = 'off' (for some reason I have it set to on by default)

in pgpool container, when replicas are healthy:

in pgpool container, after replicas are down: _postgres-db-pg-cluster-pgpool-547d6cfff7-bgm5n:/$ psql -h localhost -c "show poolnodes;" psql: error: connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer" connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer"

and I can't even check the config at that point: _postgres-db-pg-cluster-pgpool-547d6cfff7-bgm5n:/$ psql -h localhost -c "pgpool show healthcheck;" psql: error: connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer" connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer"

pgpool-loosing-replicas.log pgpool_show_all.csv pgpool_conf.txt

pengbo0328 commented 3 months ago

I actually couldn't reproduce it. I think this may occur when a pooled connection was disconnected but Pgpool-II does not detect it. If you set connection_cache = 'false', the same issue occurs?

pgpool / pgpool2_on_k8s

what happens when temporarily loosing replicas #30