add keeper flag to generate pgHBA config on restarts

FarhanSajid1 commented 4 years ago

What would you like to be added: add keeper flag skip-hba-render to control whether we generate pgHBA config in the situation where the db is stopped, role is master, and synchronous replication is enabled.

Why is this needed: At the moment Stolon currently does not render the PGHBA section of the postgresql.json file when the previous cluster's master is shut down when synchronous_replication is enabled. This can lead to a situation where certain entries for users that are needed in an environment do not have access to the database. We end up seeing no pg_hba.conf entry for host \"127.0.0.1\", user \"ns1\", database \"ns1\", SSL off

Concrete Example: If we have a 3 node cluster set up with synchronous_replication enabled and our pgHBA configuration consists of the following entries.

  "pgHBA": [
    "host ns1 ns1 127.0.0.1/32 trust",
    "host ns1 ns1 ::1/128 trust",
  "local all all  trust",
  "host all all 127.0.0.1/32 md5",
  "host all all ::1/128 md5"
  ],

After we stop the the cluster, either forced or not, upon restarting the previous master node will not have the custom pg_hba configuration generated, this results in the no pg_hba.conf entry for host message shown above. This is a situation where we need the custom pg_hba entries to be generated.

Entries we see afterwards, do not contain the custom entries, because we pass in true to generateHBA in this scenario.

local postgres postgres md5
local replication replicator md5
host all postgres 0.0.0.0/0 md5
host all postgres ::0/0 md5
host replication replicator 0.0.0.0/0 md5
host replication replicator ::0/0 md5

sgotti commented 4 years ago

@FarhanSajid1 Sorry but I can't understand where the issue is. It should have the entry once it has correctly entered the stolon cluster, the explanation is here: https://github.com/nsone/stolon/commit/87766c982c3fa8fc2ac899165dce690c18f5655f

A flag is not really needed, if the primary isn't reentering the cluster we should understand what is happening. Are the other previous synchronous standbys restarting correctly? If not can you provide a reproducer and the sentinel/keepers logs?

FarhanSajid1 commented 4 years ago

@FarhanSajid1 Sorry but I can't understand where the issue is. It should have the entry once it has correctly entered the stolon cluster, the explanation is here: nsone@87766c9

A flag is not really needed, if the primary isn't reentering the cluster we should understand what is happening. Are the other previous synchronous standbys restarting correctly? If not can you provide a reproducer and the sentinel/keepers logs?

@sgotti So the issue is that the custom pgHBA entries aren't rendered in this particular scenario. Pretty much the downed primary node will be in a continuous no pg_hba.conf entry for host \"127.0.0.1\", user \"ns1\", database \"ns1\", SSL of loop, and never join the cluster properly. The flag gives us the option to still render the pg_hba.conf file based on the custom fields. The other standbys restart, but then try to connect to the primary which is in the restart loop defined above, so they do not come up healthy either, unless you restart the primary and trigger a failover.

If this is the expected behavior, what should we do in order to prevent this? We still need to generate the custom entries in the pgHBA section.

replica logs

2020-07-13 01:44:43.213 UTC [25440] FATAL:  could not connect to the primary server: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
2020-07-13 01:44:48.215 UTC [25828] FATAL:  could not connect to the primary server: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.

primary

2020-07-13T01:42:45.595Z    INFO    cmd/keeper.go:1462  our db requested role is master
2020-07-13T01:42:45.603Z    INFO    cmd/keeper.go:1498  already master
2020-07-13T01:42:45.643Z    INFO    cmd/keeper.go:1631  postgres parameters not changed
2020-07-13T01:42:45.664Z    INFO    cmd/keeper.go:1644  not allowing connection as normal users since synchronous replication is enabled, instance was down and not all sync standbys are synced
2020-07-13T01:42:45.665Z    INFO    cmd/keeper.go:1658  postgres hba entries not changed
2020-07-13 01:42:46.779 UTC [16856] FATAL:  no pg_hba.conf entry for host "127.0.0.1", user "ns1", database "ns1", SSL off

replica pg_hba.conf

root@d1778812a00f:/# cat  ns1/data/var/lib/postgresql/data/cluster/postgres/pg_hba.conf
local postgres postgres md5
local replication replicator md5
host all postgres 0.0.0.0/0 md5
host all postgres ::0/0 md5
host replication replicator 0.0.0.0/0 md5
host replication replicator ::0/0 md5
host ns1 ns1 127.0.0.1/32 trust
host ns1 ns1 ::1/128 trust
local all all  trust
host all all 127.0.0.1/32 md5
host all all ::1/128 md5

primary pg_hba.conf

local postgres postgres md5
local replication replicator md5
host all postgres 0.0.0.0/0 md5
host all postgres ::0/0 md5
host replication replicator 0.0.0.0/0 md5
host replication replicator ::0/0 md5

sorintlab / stolon

add keeper flag to generate pgHBA config on restarts #792