Open FarhanSajid1 opened 4 years ago
@FarhanSajid1 Sorry but I can't understand where the issue is. It should have the entry once it has correctly entered the stolon cluster, the explanation is here: https://github.com/nsone/stolon/commit/87766c982c3fa8fc2ac899165dce690c18f5655f
A flag is not really needed, if the primary isn't reentering the cluster we should understand what is happening. Are the other previous synchronous standbys restarting correctly? If not can you provide a reproducer and the sentinel/keepers logs?
@FarhanSajid1 Sorry but I can't understand where the issue is. It should have the entry once it has correctly entered the stolon cluster, the explanation is here: nsone@87766c9
A flag is not really needed, if the primary isn't reentering the cluster we should understand what is happening. Are the other previous synchronous standbys restarting correctly? If not can you provide a reproducer and the sentinel/keepers logs?
@sgotti So the issue is that the custom pgHBA entries aren't rendered in this particular scenario. Pretty much the downed primary node will be in a continuous no pg_hba.conf entry for host \"127.0.0.1\", user \"ns1\", database \"ns1\", SSL of
loop, and never join the cluster properly. The flag gives us the option to still render the pg_hba.conf
file based on the custom fields. The other standbys restart, but then try to connect to the primary which is in the restart loop defined above, so they do not come up healthy either, unless you restart the primary and trigger a failover.
If this is the expected behavior, what should we do in order to prevent this? We still need to generate the custom entries in the pgHBA
section.
replica logs
2020-07-13 01:44:43.213 UTC [25440] FATAL: could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
2020-07-13 01:44:48.215 UTC [25828] FATAL: could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
primary
2020-07-13T01:42:45.595Z INFO cmd/keeper.go:1462 our db requested role is master
2020-07-13T01:42:45.603Z INFO cmd/keeper.go:1498 already master
2020-07-13T01:42:45.643Z INFO cmd/keeper.go:1631 postgres parameters not changed
2020-07-13T01:42:45.664Z INFO cmd/keeper.go:1644 not allowing connection as normal users since synchronous replication is enabled, instance was down and not all sync standbys are synced
2020-07-13T01:42:45.665Z INFO cmd/keeper.go:1658 postgres hba entries not changed
2020-07-13 01:42:46.779 UTC [16856] FATAL: no pg_hba.conf entry for host "127.0.0.1", user "ns1", database "ns1", SSL off
replica pg_hba.conf
root@d1778812a00f:/# cat ns1/data/var/lib/postgresql/data/cluster/postgres/pg_hba.conf
local postgres postgres md5
local replication replicator md5
host all postgres 0.0.0.0/0 md5
host all postgres ::0/0 md5
host replication replicator 0.0.0.0/0 md5
host replication replicator ::0/0 md5
host ns1 ns1 127.0.0.1/32 trust
host ns1 ns1 ::1/128 trust
local all all trust
host all all 127.0.0.1/32 md5
host all all ::1/128 md5
primary pg_hba.conf
local postgres postgres md5
local replication replicator md5
host all postgres 0.0.0.0/0 md5
host all postgres ::0/0 md5
host replication replicator 0.0.0.0/0 md5
host replication replicator ::0/0 md5
What would you like to be added: add keeper flag
skip-hba-render
to control whether we generate pgHBA config in the situation where the db is stopped, role is master, and synchronous replication is enabled.Why is this needed: At the moment Stolon currently does not render the
PGHBA
section of thepostgresql.json
file when the previous cluster's master is shut down whensynchronous_replication
is enabled. This can lead to a situation where certain entries for users that are needed in an environment do not have access to the database. We end up seeingno pg_hba.conf entry for host \"127.0.0.1\", user \"ns1\", database \"ns1\", SSL off
Concrete Example: If we have a 3 node cluster set up with
synchronous_replication
enabled and ourpgHBA
configuration consists of the following entries.After we stop the the cluster, either forced or not, upon restarting the previous master node will not have the custom pg_hba configuration generated, this results in the
no pg_hba.conf entry for host
message shown above. This is a situation where we need the custom pg_hba entries to be generated.Entries we see afterwards, do not contain the custom entries, because we pass in
true
togenerateHBA
in this scenario.