pgpool / pgpool2_on_k8s

57 stars 36 forks source link

what happens when temporarily loosing replicas #30

Open aminebt opened 3 weeks ago

aminebt commented 3 weeks ago

This is a question whether the behavior we observe in k8s is the expected one, otherwise if there's a misconfiguration on our side.

Observed behavior: while replicas are down, all postgres clients report errors connecting to postgres (through pgpool), despite the leader postgres pod being healthy.
Expected behavior: given postgres leader is still up and running, I was expecting pgpool to accept client connections and use the remaining healthy backend. How to reproduce it : we can simulate this situation based on any similar set-up on k8s and tampering with the replica service selector so that it doesn't have any endpoints (this is equivalent to replica pods being unhealthy and therefore excluded from the replica service).

Remark: the expected behavior can be achieved if I set those two parameters to the values below : backend_flag1 = 'ALLOW_TO_FAILOVER'
failover_on_backend_error = 'on' but in that case when replicas are back to a healthy state, the corresponding backend is not reattached, even if auto_failback is set to on (but that perhaps is a different story/question about sr_check and application_name, etc.).

appreciate your help and clarifications on this.

pengbo0328 commented 3 weeks ago

Observed behavior: while replicas are down, all postgres clients report errors connecting to postgres (through pgpool), despite the leader postgres pod being healthy.

In this case Pgpool-II should send all queries to primary.

Could you show your pgpool settings and logs?

aminebt commented 3 weeks ago

attached :

from what I can see, after we loose the replicas we start repeatedly getting the messages: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer"

the same behavior and log messages above are seen when setting : sr_check_period = 0
failover_on_backend_shutdown = 'off' (for some reason I have it set to on by default)

in pgpool container, when replicas are healthy:

_postgres-db-pg-cluster-pgpool-547d6cfff7-bgm5n:/$ psql -h localhost -c "show poolnodes;" node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change ---------+-----------------------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 | postgres-db-pg-cluster | 5432 | up | unknown | 0.500000 | primary | unknown | 78 | true | 0 | | | 2024-08-21 06:55:45 1 | postgres-db-pg-cluster-repl | 5432 | up | unknown | 0.500000 | standby | unknown | 65 | false | 0 | | | 2024-08-21 06:55:45 (2 rows)

in pgpool container, after replicas are down: _postgres-db-pg-cluster-pgpool-547d6cfff7-bgm5n:/$ psql -h localhost -c "show poolnodes;" psql: error: connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer" connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer"

and I can't even check the config at that point: _postgres-db-pg-cluster-pgpool-547d6cfff7-bgm5n:/$ psql -h localhost -c "pgpool show healthcheck;" psql: error: connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer" connection to server at "localhost" (::1), port 5432 failed: ERROR: unable to read data from DB node 1 DETAIL: socket read failed with error "Connection reset by peer"

pgpool-loosing-replicas.log pgpool_show_all.csv pgpool_conf.txt

pengbo0328 commented 2 weeks ago

I actually couldn't reproduce it. I think this may occur when a pooled connection was disconnected but Pgpool-II does not detect it. If you set connection_cache = 'false', the same issue occurs?