[BUG] CoreDNS requires restart after scaling up nodes to be able to resolve new hostnames

Describe the bug There is an issue with CoreDNS not being able to translate hostnames that have been added to the cluster after scaling it up by re-applying the configuration with the increased number of nodes. Encountered while testing PostgreSQL and pgPool.

How to reproduce Steps to reproduce the behavior:

Deploy a cluster with the following components enabled: kubernetes master, kubernetes node and 1 postgresql vm.
Increase the number of PostrgeSQL nodes to 2, enable replication, enable pgPool application deployment and re-apply the configuration.

Expected behavior PgPool is able to connect to 2 PostgreSQL nodes.

Environment AWS/Ubuntu but most likely all configurations are affected.

epicli version: [epicli --version] develop but most likely all previous versions are affected.

Additional context PgPool pod logs:

2021-05-26 11:00:02: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed

pgpool 11:00:02.51 INFO  ==> ** Starting Pgpool-II **
2021-05-26 11:00:02: pid 1: LOG:  Backend status file /opt/bitnami/pgpool/logs/pgpool_status does not exist
2021-05-26 11:00:02: pid 1: LOG:  memory cache initialized
2021-05-26 11:00:02: pid 1: DETAIL:  memcache blocks :64
2021-05-26 11:00:02: pid 1: LOG:  pool_discard_oid_maps: discarded memqcache oid maps
2021-05-26 11:00:02: pid 1: LOG:  Setting up socket for 0.0.0.0:5432
2021-05-26 11:00:02: pid 1: LOG:  Setting up socket for :::5432
2021-05-26 11:00:02: pid 1: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
2021-05-26 11:00:02: pid 1: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 1: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 1: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:02: pid 1: LOG:  find_primary_node: make_persistent_db_connection_noerror failed on node 1
2021-05-26 11:00:02: pid 1: LOG:  find_primary_node: primary node is 0
2021-05-26 11:00:02: pid 123: LOG:  PCP process: 123 started
2021-05-26 11:00:02: pid 1: LOG:  pgpool-II successfully started. version 4.1.1 (karasukiboshi)
2021-05-26 11:00:02: pid 1: LOG:  node status[0]: 1
2021-05-26 11:00:02: pid 1: LOG:  node status[1]: 0
2021-05-26 11:00:02: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:02: pid 126: LOG:  health check retrying on DB node: 1 (round:1)
2021-05-26 11:00:02: pid 124: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:02: pid 124: ERROR:  failed to make persistent db connection
2021-05-26 11:00:02: pid 124: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:07: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:07: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:07: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:07: pid 126: LOG:  health check retrying on DB node: 1 (round:2)
2021-05-26 11:00:11: pid 117: LOG:  md5 authentication successful with frontend
2021-05-26 11:00:12: pid 117: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:12: pid 117: LOG:  failed to create a backend 1 connection
2021-05-26 11:00:12: pid 117: DETAIL:  skip this backend because because failover_on_backend_error is off and we are in streaming replication mode and node is standby node
2021-05-26 11:00:12: pid 117: LOG:  pool_reuse_block: blockid: 0
2021-05-26 11:00:12: pid 117: CONTEXT:  while searching system catalog, When relcache is missed
2021-05-26 11:00:12: pid 117: LOG:  failover or failback event detected
2021-05-26 11:00:12: pid 117: DETAIL:  restarting myself
2021-05-26 11:00:12: pid 1: LOG:  child process with pid: 117 exits with status 256
2021-05-26 11:00:12: pid 1: LOG:  fork a new child process with pid: 134
2021-05-26 11:00:12: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:12: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:12: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:12: pid 126: LOG:  health check retrying on DB node: 1 (round:3)
2021-05-26 11:00:12: pid 124: ERROR:  Failed to check replication time lag
2021-05-26 11:00:12: pid 124: DETAIL:  No persistent db connection for the node 1
2021-05-26 11:00:12: pid 124: HINT:  check sr_check_user and sr_check_password
2021-05-26 11:00:12: pid 124: CONTEXT:  while checking replication time lag
2021-05-26 11:00:12: pid 124: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:12: pid 124: ERROR:  failed to make persistent db connection
2021-05-26 11:00:12: pid 124: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:17: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:17: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:17: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:17: pid 126: LOG:  health check retrying on DB node: 1 (round:4)
2021-05-26 11:00:21: pid 91: LOG:  md5 authentication successful with frontend
2021-05-26 11:00:21: pid 91: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:21: pid 91: LOG:  failed to create a backend 1 connection
2021-05-26 11:00:21: pid 91: DETAIL:  skip this backend because because failover_on_backend_error is off and we are in streaming replication mode and node is standby node
2021-05-26 11:00:21: pid 91: LOG:  failover or failback event detected
2021-05-26 11:00:21: pid 91: DETAIL:  restarting myself
2021-05-26 11:00:21: pid 1: LOG:  child process with pid: 91 exits with status 256
2021-05-26 11:00:21: pid 1: LOG:  fork a new child process with pid: 143
2021-05-26 11:00:22: pid 123: LOG:  forked new pcp worker, pid=150 socket=7
2021-05-26 11:00:22: pid 123: LOG:  PCP process with pid: 150 exit with SUCCESS.
2021-05-26 11:00:22: pid 123: LOG:  PCP process with pid: 150 exits with status 0
2021-05-26 11:00:22: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:22: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:22: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:22: pid 126: LOG:  health check retrying on DB node: 1 (round:5)
2021-05-26 11:00:22: pid 124: ERROR:  Failed to check replication time lag
2021-05-26 11:00:22: pid 124: DETAIL:  No persistent db connection for the node 1
2021-05-26 11:00:22: pid 124: HINT:  check sr_check_user and sr_check_password
2021-05-26 11:00:22: pid 124: CONTEXT:  while checking replication time lag
2021-05-26 11:00:22: pid 124: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:22: pid 124: ERROR:  failed to make persistent db connection
2021-05-26 11:00:22: pid 124: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:27: pid 126: WARNING:  failed to connect to PostgreSQL server, getaddrinfo() failed with error "Name or service not known"
2021-05-26 11:00:27: pid 126: ERROR:  failed to make persistent db connection
2021-05-26 11:00:27: pid 126: DETAIL:  connection to host:"ec2-1-1-1-1:5432" failed
2021-05-26 11:00:27: pid 126: LOG:  health check failed on node 1 (timeout:0)
2021-05-26 11:00:27: pid 126: LOG:  received degenerate backend request for node_id: 1 from pid [126]
2021-05-26 11:00:27: pid 1: LOG:  Pgpool-II parent process has received failover request

PgPool test commands showing that the node is down:

[ubuntu@ec2-x-x-x-x ~]$ kubectl exec --namespace=postgres-pool pgpool-xxxxxxxx-yyyyy -- bash -c 'pcp_node_info -h localhost -U $PGPOOL_ADMIN_USERNAME -w --node-id=1 --verbose'
Hostname               : ec2-1-1-1-1
Port                   : 5432
Status                 : 3
Weight                 : 0.500000
Status Name            : quarantine
Role                   : standby
Replication Delay      : 0
Replication State      : streaming
Replication Sync State : async
Last Status Change     : 2021-05-26 11:00:27

[ubuntu@ec2-x-x-x-x ~]$ kubectl exec --namespace=postgres-pool pgpool-xxxxxxxx-yyyyy -- bash -c 'export PGPASSWORD=$(cat /opt/bitnami/pgpool/secrets/pgpool_sr_check_password) && psql -qAtX -h localhost -U $PGPOOL_SR_CHECK_USER -d postgres -c "show pool_nodes"'
0|ec2-y-y-y-y|5432|up|0.500000|primary|533|true|0|||2021-05-26 11:00:11
1|ec2-1-1-1-1|5432|down|0.500000|standby|0|false|0|streaming|async|2021-05-26 11:00:27

root@pgpool-xxxxxxxx-yyyyy:/# psql -h ec2-x-x-x-x -U epi_pgpool_sr_check
psql: could not translate host name "ec2-x-x-x-x" to address: Name or service not known

After restarting coredns deployment it started working properly.

kubectl rollout restart deployment coredns -n kube-system

All deployments on kubernetes that use hostnames may be affected after scaling up the components (applications are unable to connect to new nodes).

DoD checklist

[x] Changelog updated (if affected version was released)
[x] COMPONENTS.md updated / doesn't need to be updated
[x] Automated tests passed (QA pipelines)
- [x] apply
- [ ] upgrade
[ ] Case covered by automated test (if possible)
[ ] Idempotency tested
[x] Documentation updated / doesn't need to be updated
[x] All conversations in PR resolved
[x] Backport tasks created / doesn't need to be backported

hitachienergy / epiphany

[BUG] CoreDNS requires restart after scaling up nodes to be able to resolve new hostnames #2345