Consul Endpoint connection fails- Name or service not known

oxycash commented 1 year ago

We are getting the error below when trying to setup on OL7, we had to make some changes to the ansible playbook to get things going. So, not sure what we missued.

Can you help us understand how consul setup, consul service registration and dnsmasq/iptables/netaddr combination is used?

bash-4.2$ consul catalog services consul postgres-cluster bash-4.2$ psql -U postgres -h master.postgres-cluster.service.consul -p 5432 psql: error: could not translate host name "master.postgres-cluster.service.consul" to address: Name or service not known

sudo netstat -ap | grep 8600 returns nothing

vitabaks commented 1 year ago

We are getting the error below when trying to setup on OL7, we had to make some changes to the ansible playbook to get things going. So, not sure what we missued.

I need to know the details of the error and what changes have been made to try to help you with this issue.

Can you help us understand how consul setup, consul service registration and dnsmasq/iptables/netaddr combination is used?

By default, DNS is served from port 53. On most operating systems, this requires elevated privileges. Rather than running Consul with an administrative or root account, we forward appropriate queries to Consul (running on an unprivileged port).

On the cluster nodes, we install and configure dnsmasq to forward the DNS port to the local Consul (if consul_dnsmasq_enable: true). Details https://developer.hashicorp.com/consul/tutorials/networking/dns-forwarding

You can use this playbook to install and configure consul client+ dnsmasq on application servers. Or you have to take care of it yourself. \ In order to configure the application servers, simply add them to the inventory file and run the playbook

Inventory (example):

[consul_instances]  # recommendation: 3 or 5-7 nodes
10.128.64.140 consul_node_role=server consul_bootstrap_expect=true
10.128.64.142 consul_node_role=server consul_bootstrap_expect=true
10.128.64.143 consul_node_role=server consul_bootstrap_expect=true
10.128.64.144 consul_node_role=client
10.128.64.145 consul_node_role=client

Note: In this example .144 and .145 are the application servers on which we install the consul in client mode (consul_node_role=client).

Run playbook:

ansible-playbook consul.yml

vitabaks commented 1 year ago

More details about how it works:

What is service discovery - https://developer.hashicorp.com/consul/docs/concepts/service-discovery
Register Services with Service Definitions - https://developer.hashicorp.com/consul/docs/discovery/services
Health Checks - https://developer.hashicorp.com/consul/docs/discovery/checks
Query services with DNS - https://developer.hashicorp.com/consul/docs/discovery/dns
Forward DNS for Consul Service Discovery - https://developer.hashicorp.com/consul/tutorials/networking/dns-forwarding

oxycash commented 1 year ago

We dont have systemd-resolved. So I made changes as below.

[dev@pgnode01 ~]$ sudo cat /etc/hosts
127.0.0.1 localhost pgnode01
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6

10.50.39.33           host_name
[dev@pgnode01 ~]$ host consul.service.consul
Host consul.service.consul not found: 3(NXDOMAIN)
[dev@pgnode01 ~]$ sudo ls /etc/systemd/
bootchart.conf  coredump.conf  journald.conf  logind.conf  pstore.conf  rhel-dmesg  system  system.conf  user  user.conf
[dev@pgnode01 ~]$ sudo vi /etc/resolv.conf
[dev@pgnode01 ~]$ sudo cat /etc/resolv.conf
search  subdomain.domain.com
nameserver      some_ip
nameserver      some_ip
nameserver      127.0.0.1
nameserver 127.0.0.1
[dev@pgnode01 ~]$ host consul.service.consul
Host consul.service.consul not found: 3(NXDOMAIN)
[dev@pgnode01 ~]$ sudo vi /etc/dnsmasq.conf
[dev@pgnode01 ~]$ sudo cat /etc/dnsmasq.d/10-consul
server=/consul/127.0.0.1#8600
server=8.8.8.8
server=9.9.9.9

Coming to the ansible playbook, we completely skipped consul part. got everything up and running except the endpoint connectivity doesnt work at all. nslookup consul.service.consul goes to internal dns servers instead of 127.0.0.1. I dont have any recursors set in consul.

oxycash commented 1 year ago

Just realized this

nslookup master.postgres-cluster.service.consul 127.0.0.1 -port=8600
Server:         127.0.0.1
Address:        127.0.0.1#8600

Name:   master.postgres-cluster.service.consul
Address: current_master_ip

nslookup replica.postgrs-cluster.service.consul 127.0.0.1 -port=8600
Server:         127.0.0.1
Address:        127.0.0.1#8600

** server can't find replica.postgrs-cluster.service.consul.OURDOMAIN.com: REFUSED

systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN

 dig @127.0.0.1 -p 8600 postgres-cluster.service.consul

; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @127.0.0.1 -p 8600 postgres-cluster.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31501
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;postgres-cluster.service.consul. IN    A

;; ANSWER SECTION:
postgres-cluster.service.consul. 0 IN   A       master_node_ip

;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue Feb 14 17:46:46 GMT 2023
;; MSG SIZE  rcvd: 76

vitabaks commented 1 year ago

I recommend leaving only nameserver 127.0.0.1 in resolv.conf, which will actually mean using dnsmasq.

Next, you can add additional servers already in the dnsmasq configuration. You can do this by labeling them in the consul_dnsmasq_servers variable.

vitabaks commented 1 year ago

Coming to the ansible playbook, we completely skipped consul part.

I do not recommend making any manual changes, use a ansible playbook.

There are quite a lot of things to consider if you want to create a really robust cluster. If you deploy your cluster manually, especially if you didn't have experience with this yet, there are very high risks of making a mistake in configuring your cluster, because of which it may not work as stable as you expect from it ...

oxycash commented 1 year ago

I recommend leaving only nameserver 127.0.0.1 in resolv.conf, which will actually mean using dnsmasq.

Next, you can add additional servers already in the dnsmasq configuration. You can do this by labeling them in the consul_dnsmasq_servers variable.

Thank you @vitabaks

instead of removing, I just re-ordered the nameserver list and stuff started working.

[dev@pgnode01 ~]$ sudo cat /etc/resolv.conf
search  subdomain.domain.com
nameserver      127.0.0.1
nameserver      some_ip
nameserver      some_ip

Master endpoint works but replica doesnt connect yet, which is strange.

vitabaks commented 1 year ago

but replica doesnt connect yet, which is strange.

what error?

dig @localhost +short master.postgres-cluster.service.consul SRV

dig @localhost +short replica.postgres-cluster.service.consul SRV

oxycash commented 1 year ago

I was playing around today and looks like something I did knocked off the tags off all nodes. I managed to get replica tag added to one of the nodes using consul service register /Path/to/consul/conf.d/, but master and other replica node still doesnt have any tag. same command didnt work on them.

So one replica is working now and master is not.

bash-4.2$ dig @localhost +short replica.postgres-cluster.service.consul SRV
1 1 5432 0a328b7d.addr.dc1.consul.
bash-4.2$ dig @localhost +short master.postgres-cluster.service.consul SRV

oxycash commented 1 year ago

bash-4.2$ sudo consul services register /etc/consul.d/conf.d
Error registering service "postgres-cluster": Unexpected response code: 400 (Invalid check: TTL must be > 0 for TTL checks)

bash-4.2$ consul catalog services
consul
postgres-cluster

bash-4.2$ consul reload
Configuration reload triggered

bash-4.2$ sudo consul services register /etc/consul.d/conf.d
Error registering service "postgres-cluster": Unexpected response code: 400 (Invalid check: TTL must be > 0 for TTL checks)
bash-4.2$ curl http://127.0.0.1:8500/v1/catalog/node-services/pgnode03 | jq .Services[].Tags[]
bash-4.2$ curl -q http://127.0.0.1:8500/v1/catalog/node-services/pgnode03 | jq .Services[].Tags[]  
bash-4.2$ curl -s http://127.0.0.1:8500/v1/catalog/node-services/pgnode02 | jq .Services[].Tags[]
"replica"

oxycash commented 1 year ago

Replica2 got the tag by manually changing service json, which I didnt have to do for Replica1.

{
    "Name": "postgres-cluster",
    "Id": "postgres-cluster-replica",
    "Port": 6432,
    "Checks": [{"http": "http://Replica2/replica", "Interval": "2s"}, {"Args": ["systemctl", "status", "pgbouncer"], "Interval": "5s"}],
    "Tags": ["replica"]
  }

Master node still doesnt have any tag. tried above solutions but doesnt work.

vitabaks commented 1 year ago

tags are set by the playbook automatically. Just run playbook consul.yml again and it will fix the configuration. Never make changes manually.

oxycash commented 1 year ago

If I understand correctly, playbook doesnt use consul.register_service and registers them manually with added pgbouncer health check, correct? does service registration happen after patroni cluster is up or before patroni cluster is setup?

vitabaks commented 1 year ago

Yes, we register the consul service ourselves to have more flexibility in configuring the service.

The service is registered at the start of the Consul service and DNS records appear after the Patroni servers with the corresponding role are checked.

oxycash commented 1 year ago

What happens when pgbouncer fails on master node? I see a PGBouncer health check in consul.

Edit: I have already tested it, master endpoint goes down and no failover happens. I understand failover for such reason isnt good enough. but this becomes a point of failure.

vitabaks commented 1 year ago

What happens when pgbouncer fails on master node? I see a PGBouncer health check in consul.

Does it make sense in DNS records if pgbouncer does not work?

You can change the terms of the service checks, the playbook gives such flexibility.

vitabaks / postgresql_cluster

Consul Endpoint connection fails- Name or service not known #254