Closed oxycash closed 1 year ago
We are getting the error below when trying to setup on OL7, we had to make some changes to the ansible playbook to get things going. So, not sure what we missued.
I need to know the details of the error and what changes have been made to try to help you with this issue.
Can you help us understand how consul setup, consul service registration and dnsmasq/iptables/netaddr combination is used?
By default, DNS is served from port 53. On most operating systems, this requires elevated privileges. Rather than running Consul with an administrative or root account, we forward appropriate queries to Consul (running on an unprivileged port).
On the cluster nodes, we install and configure dnsmasq
to forward the DNS port to the local Consul (if consul_dnsmasq_enable: true). Details https://developer.hashicorp.com/consul/tutorials/networking/dns-forwarding
You can use this playbook to install and configure consul client+ dnsmasq
on application servers. Or you have to take care of it yourself. \
In order to configure the application servers, simply add them to the inventory file and run the playbook
Inventory (example):
[consul_instances] # recommendation: 3 or 5-7 nodes
10.128.64.140 consul_node_role=server consul_bootstrap_expect=true
10.128.64.142 consul_node_role=server consul_bootstrap_expect=true
10.128.64.143 consul_node_role=server consul_bootstrap_expect=true
10.128.64.144 consul_node_role=client
10.128.64.145 consul_node_role=client
Note: In this example .144
and .145
are the application servers on which we install the consul in client mode (consul_node_role=client).
Run playbook:
ansible-playbook consul.yml
More details about how it works:
We dont have systemd-resolved. So I made changes as below.
[dev@pgnode01 ~]$ sudo cat /etc/hosts
127.0.0.1 localhost pgnode01
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
10.50.39.33 host_name
[dev@pgnode01 ~]$ host consul.service.consul
Host consul.service.consul not found: 3(NXDOMAIN)
[dev@pgnode01 ~]$ sudo ls /etc/systemd/
bootchart.conf coredump.conf journald.conf logind.conf pstore.conf rhel-dmesg system system.conf user user.conf
[dev@pgnode01 ~]$ sudo vi /etc/resolv.conf
[dev@pgnode01 ~]$ sudo cat /etc/resolv.conf
search subdomain.domain.com
nameserver some_ip
nameserver some_ip
nameserver 127.0.0.1
nameserver 127.0.0.1
[dev@pgnode01 ~]$ host consul.service.consul
Host consul.service.consul not found: 3(NXDOMAIN)
[dev@pgnode01 ~]$ sudo vi /etc/dnsmasq.conf
[dev@pgnode01 ~]$ sudo cat /etc/dnsmasq.d/10-consul
server=/consul/127.0.0.1#8600
server=8.8.8.8
server=9.9.9.9
Coming to the ansible playbook, we completely skipped consul part. got everything up and running except the endpoint connectivity doesnt work at all. nslookup consul.service.consul goes to internal dns servers instead of 127.0.0.1. I dont have any recursors set in consul.
Just realized this
nslookup master.postgres-cluster.service.consul 127.0.0.1 -port=8600
Server: 127.0.0.1
Address: 127.0.0.1#8600
Name: master.postgres-cluster.service.consul
Address: current_master_ip
nslookup replica.postgrs-cluster.service.consul 127.0.0.1 -port=8600
Server: 127.0.0.1
Address: 127.0.0.1#8600
** server can't find replica.postgrs-cluster.service.consul.OURDOMAIN.com: REFUSED
systemctl --version
systemd 219
+PAM +AUDIT +SELINUX +IMA -APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 -SECCOMP +BLKID +ELFUTILS +KMOD +IDN
dig @127.0.0.1 -p 8600 postgres-cluster.service.consul
; <<>> DiG 9.11.4-P2-RedHat-9.11.4-26.P2.el7_9.8 <<>> @127.0.0.1 -p 8600 postgres-cluster.service.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 31501
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;postgres-cluster.service.consul. IN A
;; ANSWER SECTION:
postgres-cluster.service.consul. 0 IN A master_node_ip
;; Query time: 0 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Tue Feb 14 17:46:46 GMT 2023
;; MSG SIZE rcvd: 76
I recommend leaving only nameserver 127.0.0.1
in resolv.conf
, which will actually mean using dnsmasq.
Next, you can add additional servers already in the dnsmasq configuration. You can do this by labeling them in the consul_dnsmasq_servers
variable.
Coming to the ansible playbook, we completely skipped consul part.
I do not recommend making any manual changes, use a ansible playbook.
There are quite a lot of things to consider if you want to create a really robust cluster. If you deploy your cluster manually, especially if you didn't have experience with this yet, there are very high risks of making a mistake in configuring your cluster, because of which it may not work as stable as you expect from it ...
I recommend leaving only
nameserver 127.0.0.1
inresolv.conf
, which will actually mean using dnsmasq.Next, you can add additional servers already in the dnsmasq configuration. You can do this by labeling them in the
consul_dnsmasq_servers
variable.
Thank you @vitabaks
instead of removing, I just re-ordered the nameserver list and stuff started working.
[dev@pgnode01 ~]$ sudo cat /etc/resolv.conf
search subdomain.domain.com
nameserver 127.0.0.1
nameserver some_ip
nameserver some_ip
Master endpoint works but replica doesnt connect yet, which is strange.
but replica doesnt connect yet, which is strange.
what error?
dig @localhost +short master.postgres-cluster.service.consul SRV
dig @localhost +short replica.postgres-cluster.service.consul SRV
I was playing around today and looks like something I did knocked off the tags off all nodes. I managed to get replica tag added to one of the nodes using consul service register /Path/to/consul/conf.d/
, but master and other replica node still doesnt have any tag. same command didnt work on them.
So one replica is working now and master is not.
bash-4.2$ dig @localhost +short replica.postgres-cluster.service.consul SRV
1 1 5432 0a328b7d.addr.dc1.consul.
bash-4.2$ dig @localhost +short master.postgres-cluster.service.consul SRV
bash-4.2$ sudo consul services register /etc/consul.d/conf.d
Error registering service "postgres-cluster": Unexpected response code: 400 (Invalid check: TTL must be > 0 for TTL checks)
bash-4.2$ consul catalog services
consul
postgres-cluster
bash-4.2$ consul reload
Configuration reload triggered
bash-4.2$ sudo consul services register /etc/consul.d/conf.d
Error registering service "postgres-cluster": Unexpected response code: 400 (Invalid check: TTL must be > 0 for TTL checks)
bash-4.2$ curl http://127.0.0.1:8500/v1/catalog/node-services/pgnode03 | jq .Services[].Tags[]
bash-4.2$ curl -q http://127.0.0.1:8500/v1/catalog/node-services/pgnode03 | jq .Services[].Tags[]
bash-4.2$ curl -s http://127.0.0.1:8500/v1/catalog/node-services/pgnode02 | jq .Services[].Tags[]
"replica"
Replica2 got the tag by manually changing service json, which I didnt have to do for Replica1.
{
"Name": "postgres-cluster",
"Id": "postgres-cluster-replica",
"Port": 6432,
"Checks": [{"http": "http://Replica2/replica", "Interval": "2s"}, {"Args": ["systemctl", "status", "pgbouncer"], "Interval": "5s"}],
"Tags": ["replica"]
}
Master node still doesnt have any tag. tried above solutions but doesnt work.
tags are set by the playbook automatically. Just run playbook consul.yml again and it will fix the configuration. Never make changes manually.
If I understand correctly, playbook doesnt use consul.register_service and registers them manually with added pgbouncer health check, correct? does service registration happen after patroni cluster is up or before patroni cluster is setup?
Yes, we register the consul service ourselves to have more flexibility in configuring the service.
The service is registered at the start of the Consul service and DNS records appear after the Patroni servers with the corresponding role are checked.
What happens when pgbouncer fails on master node? I see a PGBouncer health check in consul.
Edit: I have already tested it, master endpoint goes down and no failover happens. I understand failover for such reason isnt good enough. but this becomes a point of failure.
What happens when pgbouncer fails on master node? I see a PGBouncer health check in consul.
Does it make sense in DNS records if pgbouncer does not work?
You can change the terms of the service checks, the playbook gives such flexibility.
We are getting the error below when trying to setup on OL7, we had to make some changes to the ansible playbook to get things going. So, not sure what we missued.
Can you help us understand how consul setup, consul service registration and dnsmasq/iptables/netaddr combination is used?
bash-4.2$ consul catalog services consul postgres-cluster bash-4.2$ psql -U postgres -h master.postgres-cluster.service.consul -p 5432 psql: error: could not translate host name "master.postgres-cluster.service.consul" to address: Name or service not known
sudo netstat -ap | grep 8600 returns nothing