hitachienergy / epiphany

Cloud and on-premises automation for Kubernetes centered industrial grade solutions.
Apache License 2.0
138 stars 107 forks source link

[BUG] CoreDNS requires restart after scaling up nodes to be able to resolve new hostnames #2345

Closed przemyslavic closed 3 years ago

przemyslavic commented 3 years ago

Describe the bug There is an issue with CoreDNS not being able to translate hostnames that have been added to the cluster after scaling it up by re-applying the configuration with the increased number of nodes. Encountered while testing PostgreSQL and pgPool.

How to reproduce Steps to reproduce the behavior:

  1. Deploy a cluster with the following components enabled: kubernetes master, kubernetes node and 1 postgresql vm.
  2. Increase the number of PostrgeSQL nodes to 2, enable replication, enable pgPool application deployment and re-apply the configuration.

Expected behavior PgPool is able to connect to 2 PostgreSQL nodes.

Environment AWS/Ubuntu but most likely all configurations are affected.

epicli version: [epicli --version] develop but most likely all previous versions are affected.

Additional context PgPool pod logs:

2021-05-26 11:00:02: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
pgpool 11:00:02.51 INFO  ==> ** Starting Pgpool-II **
2021-05-26 11:00:02: pid 1: LOG:  Backend status file /opt/bitnami/pgpool/logs/pgpool_status does not exist
2021-05-26 11:00:02: pid 1: LOG:  memory cache initialized
2021-05-26 11:00:02: pid 1: DETAIL:  memcache blocks :64
2021-05-26 11:00:02: pid 1: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2021-05-26 11:00:02: pid 1: LOG:  Setting up socket for 0.0.0.0:5432
2021-05-26 11:00:02: pid 1: LOG:  Setting up socket for :::5432
2021-05-26 11:00:02: pid 1: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2021-05-26 11:00:02: pid 1: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 1: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 1: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:02: pid 1: LOG:  find_primary_node: make_persistent_db_connection_noerror failed on node 1
2021-05-26 11:00:02: pid 1: LOG:  find_primary_node: primary node is 0
2021-05-26 11:00:02: pid 123: LOG:  PCP process: 123 started
2021-05-26 11:00:02: pid 1: LOG:  pgpool-II successfully started. version 4.1.1 (karasukiboshi)
2021-05-26 11:00:02: pid 1: LOG:  node status[0]: 1
2021-05-26 11:00:02: pid 1: LOG:  node status[1]: 0
2021-05-26 11:00:02: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:02: pid 126: LOG:  health check retrying on DB node: 1 (round:1)
2021-05-26 11:00:02: pid 124: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 124: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 124: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:07: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:07: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:07: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:07: pid 126: LOG:  health check retrying on DB node: 1 (round:2)
2021-05-26 11:00:11: pid 117: LOG:  md5 authentication successful with frontend
2021-05-26 11:00:12: pid 117: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:12: pid 117: LOG:  failed to create a backend 1 connection
2021-05-26 11:00:12: pid 117: DETAIL:  skip this backend because because failover_on_backend_error is off and we are in streaming replication mode and node is standby node
2021-05-26 11:00:12: pid 117: LOG:  pool_reuse_block: blockid: 0
2021-05-26 11:00:12: pid 117: CONTEXT:  while searching system catalog, When relcache is missed
2021-05-26 11:00:12: pid 117: LOG:  failover or failback event detected
2021-05-26 11:00:12: pid 117: DETAIL:  restarting myself
2021-05-26 11:00:12: pid 1: LOG:  child process with pid: 117 exits with status 256
2021-05-26 11:00:12: pid 1: LOG:  fork a new child process with pid: 134
2021-05-26 11:00:12: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:12: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:12: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:12: pid 126: LOG:  health check retrying on DB node: 1 (round:3)
2021-05-26 11:00:12: pid 124: ERROR:  Failed to check replication time lag
2021-05-26 11:00:12: pid 124: DETAIL:  No persistent db connection for the node 1
2021-05-26 11:00:12: pid 124: HINT:  check sr_check_user and sr_check_password
2021-05-26 11:00:12: pid 124: CONTEXT:  while checking replication time lag
2021-05-26 11:00:12: pid 124: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:12: pid 124: ERROR:  failed to make persistent db connection
2021-05-26 11:00:12: pid 124: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:17: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:17: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:17: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:17: pid 126: LOG:  health check retrying on DB node: 1 (round:4)
2021-05-26 11:00:21: pid 91: LOG:  md5 authentication successful with frontend
2021-05-26 11:00:21: pid 91: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:21: pid 91: LOG:  failed to create a backend 1 connection
2021-05-26 11:00:21: pid 91: DETAIL:  skip this backend because because failover_on_backend_error is off and we are in streaming replication mode and node is standby node
2021-05-26 11:00:21: pid 91: LOG:  failover or failback event detected
2021-05-26 11:00:21: pid 91: DETAIL:  restarting myself
2021-05-26 11:00:21: pid 1: LOG:  child process with pid: 91 exits with status 256
2021-05-26 11:00:21: pid 1: LOG:  fork a new child process with pid: 143
2021-05-26 11:00:22: pid 123: LOG:  forked new pcp worker, pid=150 socket=7
2021-05-26 11:00:22: pid 123: LOG:  PCP process with pid: 150 exit with SUCCESS.
2021-05-26 11:00:22: pid 123: LOG:  PCP process with pid: 150 exits with status 0
2021-05-26 11:00:22: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:22: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:22: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:22: pid 126: LOG:  health check retrying on DB node: 1 (round:5)
2021-05-26 11:00:22: pid 124: ERROR:  Failed to check replication time lag
2021-05-26 11:00:22: pid 124: DETAIL:  No persistent db connection for the node 1
2021-05-26 11:00:22: pid 124: HINT:  check sr_check_user and sr_check_password
2021-05-26 11:00:22: pid 124: CONTEXT:  while checking replication time lag
2021-05-26 11:00:22: pid 124: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:22: pid 124: ERROR:  failed to make persistent db connection
2021-05-26 11:00:22: pid 124: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:27: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:27: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:27: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:27: pid 126: LOG:  health check failed on node 1 (timeout:0)
2021-05-26 11:00:27: pid 126: LOG:  received degenerate backend request for node_id: 1 from pid [126]
2021-05-26 11:00:27: pid 1: LOG:  Pgpool-II parent process has received failover request

PgPool test commands showing that the node is down:

[ubuntu@ec2-x-x-x-x ~]$ kubectl exec --namespace=postgres-pool pgpool-xxxxxxxx-yyyyy -- bash -c 'pcp_node_info -h localhost -U $PGPOOL_ADMIN_USERNAME -w --node-id=1 --verbose'
Hostname               : ec2-1-1-1-1
Port                   : 5432
Status                 : 3
Weight                 : 0.500000
Status Name            : quarantine
Role                   : standby
Replication Delay      : 0
Replication State      : streaming
Replication Sync State : async
Last Status Change     : 2021-05-26 11:00:27

[ubuntu@ec2-x-x-x-x ~]$ kubectl exec --namespace=postgres-pool pgpool-xxxxxxxx-yyyyy -- bash -c 'export PGPASSWORD=$(cat /opt/bitnami/pgpool/secrets/pgpool_sr_check_password) && psql -qAtX -h localhost -U $PGPOOL_SR_CHECK_USER -d postgres -c "show pool_nodes"'
0|ec2-y-y-y-y|5432|up|0.500000|primary|533|true|0|||2021-05-26 11:00:11
1|ec2-1-1-1-1|5432|down|0.500000|standby|0|false|0|streaming|async|2021-05-26 11:00:27
root@pgpool-xxxxxxxx-yyyyy:/# psql -h ec2-x-x-x-x -U epi_pgpool_sr_check
psql: could not translate host name "ec2-x-x-x-x" to address: Name or service not known

After restarting coredns deployment it started working properly.

kubectl rollout restart deployment coredns -n kube-system

All deployments on kubernetes that use hostnames may be affected after scaling up the components (applications are unable to connect to new nodes).


DoD checklist

plirglo commented 3 years ago

We may consider use coredns reload plugin

przemyslavic commented 3 years ago

:heavy_check_mark: Tested automatically in the pipeline (scaling up) and additionally manually verified pgpool status and connection to PostgreSQL nodes.