Error when disconnecting the network on postgres and pgpool standby node

hernanpezzani commented 1 week ago

The postgres and pgpool configuration is the following one: version postgresql 14.12 and pgpool 4.4.7 Node 0 is primary postgresql server (backend 0) and standby pgpool_node, Node 1 is the standby postgresql (backend 1) and standby pgpool_node Node 2 is LEADER of pgpool with delegate_IP UP and it does not have postgresql. WatchDog is enabled with hearthbeat

Postgres works perfectly, reads, writes and works with pgpool connecting from port 9999.

Detail of the failure: When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password. psql -h delegate_IP -p 9999 -U prueba -d prueba after 1-2 min prompt password

while the network is DOWN on node 1, direct access to node 0 postgresql through port 5432 is correct, the replication between the nodes becomes asynchronous and postgres operation is as expected, but access through IP delegated by port 9999 is still delayed

When the network is TURNED ON again, the pgpool service continues delaying connections until the service gets restarted on Node 2 (PGPOOL LEADER).

PGPOOL STATE NETWORK DOWN 3 3 YES nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com nodo0.dominio.com:9999 Linux nodo0.dominio.com nodo0.dominio.com 9999 90000 7 STANDBY 0 MEMBER nodo1.dominio.com:9999 Linux nodo1.dominio.com nodo1.dominio.com 9999 90000 8 LOST 0 MEMBER nodo2.dominio.com:9999 Linux nodo2.dominio.com nodo2.dominio.com 9999 90000 4 LEADER 0 MEMBER

config pgpool delegate_IP = '172.27.37.56' if_cmd_path = '/sbin' if_up_cmd = '/bin/sudo /sbin/ip addr add $IP$/24 dev ens192 label ens192:0' if_down_cmd = '/bin/sudo /sbin/ip addr del $IP$/24 dev ens192' arping_path = '/usr/sbin' arping_cmd = '/bin/sudo /sbin/arping -U $IP$ -w 1 -I ens192'

load_balance_mode = off connection_cache = on

backend_hostname0 = 'nodo0.dominio.com' backend_port0 = 5432 backend_data_directory0 = '/postgresql/14/data/mdona-cpdm-pre-cluster07' backend_weight0 = 1 backend_application_name0 = 'nodo0' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_hostname1 = 'nodo1.dominio.com' backend_port1 = 5432 backend_data_directory1 = '/postgresql/14/data/mdona-cpdm-pre-cluster07' backend_weight1 = 1 backend_application_name1 = 'nodo1' backend_flag1 = 'ALLOW_TO_FAILOVER'

use_watchdog = on wd_escalation_command = '/etc/pgpool-II/escalation.sh' hostname0 = 'nodo0.dominio.com' pgpool_port0 = 9999 wd_port0 = 9000 heartbeat_hostname0 = 'nodo0.dominio.com' heartbeat_port0 = 9694 hostname1 = 'nodo1.dominio.com' pgpool_port1 = 9999 wd_port1 = 9000 heartbeat_hostname1 = 'nodo1.dominio.com heartbeat_port1 = 9694 hostname2 = 'nodo2.dominio.com' pgpool_port2 = 9999 wd_port2 = 9000 heartbeat_hostname2 = 'nodo2.dominio.com' heartbeat_port2 = 9694

process_management_mode = dynamic process_management_strategy = gentle min_spare_children = 10 max_spare_children = 20

health_check_period = 5 health_check_timeout = 20 health_check_user = 'pgpool2' health_check_password = '' health_check_database = '' health_check_max_retries = 0 health_check_retry_delay = 1 connect_timeout = 10000

num_init_children = 500 max_pool = 2

child_life_time = 120 child_max_connections = 0 connection_life_time = 120 client_idle_limit = 0

backend_clustering_mode = 'streaming_replication' listen_addresses = '*' port = 9999 socket_dir = '/tmp'

Hernan Pezzani

pengbo0328 commented 4 days ago

@hernanpezzani Thank you for reporting this issue.

I tried to reproduce this issue, but I couldn't reproduce it.

Detail of the failure: When the network of the Node 1 is TURNED DOWN (VMWARE disconnect network), the cluster pgpool delays connection to the database through the delegate_IP and port 9999. We have checked it and it takes 1-2 min to ask for the password. psql -h delegate_IP -p 9999 -U prueba -d prueba after 1-2 min prompt password

Node1 is the standby pgpool and standby postgres. Does this issue only occur when you TURN DOWN the network on Node1? If you TURN DOWN the network on other nodes, the same issue occurs?

I could not find the configuration of failover scripts. Does pgpool manage failover of PostgreSQL?

Could you enable the debug log and reproduce this issue again?

log_min_messages = debug5

Please share:

pgpool.conf
the debug log
time when you turned down the network
time when you ran psql command

hernanpezzani commented 3 days ago

Hi @pengbo0328

Node1 is the standby pgpool and standby postgres. Yes

Does this issue only occur when you TURN DOWN the network on Node1? If you TURN DOWN the network on other nodes, the same issue occurs?

We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool). The network loss simulation is done by disconnecting the virtual machine network (uncheck connected), when the services are stopped or the node is restarted or stopped the behavior is as expected.

I could not find the configuration of failover scripts. Does pgpool manage failover of PostgreSQL? Yes

Could you enable the debug log and reproduce this issue again? Yes

time when you turned down the network: "5 -10 min approximately"

time when you ran psql command: "every 1 sec we try to connect from an external device"

We are preparing (waiting for the OK from security) the file with the scripts.sh, .conf and only the logs with debug5 of the primary pgpool node (node2), the file size is about 50Mb, we attach it to the issue?

Do you need the logs of all 3 nodes?

hernanpezzani commented 3 days ago

In the log with debug5 enabled, the lines with the format of the example, which is what is after "->" (marked in bold)

example: `postgresSELECT count() FROM pg_catalog.pg_proc AS p, pg_catalog.pg_namespace AS n WHERE p.proname = 'pg_read_file' AND n.oid = p.pronamespace AND n.nspname ~ '.' AND p.provolatile = 'v'postgres' -> '5564261919auUa8uyt9ceceb0626896e'.

Is the user's password?

tatsuo-ishii commented 3 days ago

Is the user's password?

No.

hernanpezzani commented 3 days ago

I cannot upload files larger than 25 mb. If I split it with split -b 20M file.tar.gz "file.part" it does not allow me to upload the extensions partaa, etc. 2020-08-26 13_31_23-Error del Servidor de Contenidos 2020-08-26 16_21_38-Error del Servidor de Contenidos

pengbo0328 commented 2 days ago

@hernanpezzani

We have not simulated it because we focused on seeing why it failed when both nodes were standby (postgres and pgpool). The network loss simulation is done by disconnecting the virtual machine network (uncheck connected), when the services are stopped or the node is restarted or stopped the behavior is as expected.

I tried the same way to disconnect the netwrok (uncheck connected on VMWare settings). Then the pgpool node is in LOST status, and standby PostgreSQL is in down status.

I cannot upload files larger than 25 mb. If I split it with split -b 20M file.tar.gz "file.part" it does not allow me to upload the extensions partaa, etc.

Thank you for preparing the debug log. OK. Because it is in a large size, could you share the default logs with "log_min_messages = warning"? I will try to check whether the cause can be identified from the default logs.

If you could share the pgpool.conf and the scripts (failover and escalation scripts), I will try to reproduce this issue and enable debug logs by myself.

hernanpezzani commented 2 days ago

issue55

please rename the file to zip

hernanpezzani commented 2 days ago

https://drive.google.com/file/d/1Dso7q_hM6NlT2_7Ybjyd3nH0-nWyuY1t/view?usp=drive_link

link to drive.google

pgpool / pgpool2

Error when disconnecting the network on postgres and pgpool standby node #55