Fail-over did not work for me

roncemer commented 11 months ago

Running in k3s on a 3-node cluster. Start a Kubegres cluster with 3 nodes using the Getting Started manifest. Delete the sr-postgres-1-0 pod. The sr-postgres service switches to the sr-postgres-2-0 pod, which immediately fails with the error below.

PostgreSQL Database directory appears to contain a database; Skipping initialization 2023-12-06 19:02:01.807 UTC [1] LOG: starting PostgreSQL 14.7 (Debian 14.7-1.pgdg110+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit 2023-12-06 19:02:01.807 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432 2023-12-06 19:02:01.807 UTC [1] LOG: listening on IPv6 address "::", port 5432 2023-12-06 19:02:01.811 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" 2023-12-06 19:02:01.818 UTC [1] LOG: received fast shutdown request 2023-12-06 19:02:01.818 UTC [22] LOG: database system was shut down in recovery at 2023-12-06 19:00:27 UTC 2023-12-06 19:02:01.818 UTC [22] LOG: entering standby mode 2023-12-06 19:02:01.822 UTC [22] LOG: consistent recovery state reached at 0/1DB04B98 2023-12-06 19:02:01.822 UTC [22] LOG: invalid record length at 0/1DB04B98: wanted 24, got 0 2023-12-06 19:02:01.824 UTC [23] LOG: shutting down 2023-12-06 19:02:01.891 UTC [1] LOG: database system is shut down

Now, the sr-postgres service is failed, with the following message:

Last FailOver attempt has timed-out after 300 seconds. The new Primary DB is still NOT ready. It must be fixed manually. Until the PrimaryDB is ready, most of the features of Kubegres are disabled for safety reason. 'Primary DB StatefulSet to fix': sr-postgres-2 - FailOver timed-out

A production-ready PostgreSQL HA solution would automatically detect this situation and heal itself. This is NOT production-ready.

I'm trying to use this for deploying a 3-node appliance which will be located in multiple data centers around the world. It absolutely MUST be completely bullet-proof. This is NOT bullet-proof; it's not even minimally capable of recovering from any type of failure. I don't understand how this is remotely considered useful.

roncemer commented 11 months ago

Ansible template for my manifest attached. postgresql-cluster-kubegres.yaml.j2.txt

dracon80 commented 11 months ago

I can confirm that I also get this same error. I'm also running K3S, and using longhorn for the storage. Deploying the cluster worked without issue and I was able to connect and work with the cluster. gpAdmin showed the cluster as healthy. However once the primary pod was deleted failover started and resulted in the same error. The pod was also never recreated, so I suspect it is simply that manually deleting the pod is causing some issues.

alex-arica commented 11 months ago

We test new changes on Kubegres by running automated tests located in the folder: https://github.com/reactive-tech/kubegres/tree/main/internal/test

There are more than 93 tests. The use case that you are referring to is tested by: https://github.com/reactive-tech/kubegres/blob/main/internal/test/primary_failure_and_recovery_test.go

I tried manually to reproduce the issue that you reported and I could not reproduce it. Please see below the logs for the postgres-2-0 pod:

Defaulted container "mypostgres-2" out of: mypostgres-2, setup-replica-data-directory (init)

PostgreSQL Database directory appears to contain a database; Skipping initialization

2023-12-10 09:25:00.864 GMT [1] LOG:  starting PostgreSQL 16.0 (Debian 16.0-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2023-12-10 09:25:00.864 GMT [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2023-12-10 09:25:00.864 GMT [1] LOG:  listening on IPv6 address "::", port 5432
2023-12-10 09:25:00.873 GMT [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-12-10 09:25:00.884 GMT [41] LOG:  database system was shut down in recovery at 2023-12-10 09:24:59 GMT
2023-12-10 09:25:00.884 GMT [41] LOG:  entering standby mode
2023-12-10 09:25:00.890 GMT [41] LOG:  consistent recovery state reached at 0/50223B0
2023-12-10 09:25:00.890 GMT [41] LOG:  invalid record length at 0/50223B0: expected at least 24, got 0
2023-12-10 09:25:00.890 GMT [1] LOG:  database system is ready to accept read-only connections
2023-12-10 09:25:00.896 GMT [45] FATAL:  could not connect to the primary server: could not translate host name "mypostgres" to address: Name or service not known
2023-12-10 09:25:00.898 GMT [46] FATAL:  could not connect to the primary server: could not translate host name "mypostgres" to address: Name or service not known
2023-12-10 09:25:00.898 GMT [41] LOG:  waiting for WAL to become available at 0/5002000
2023-12-10 09:25:00.935 GMT [41] LOG:  received promote request
2023-12-10 09:25:00.935 GMT [41] LOG:  redo is not required
2023-12-10 09:25:00.941 GMT [41] LOG:  selected new timeline ID: 2
2023-12-10 09:25:00.980 GMT [41] LOG:  archive recovery complete
2023-12-10 09:25:00.991 GMT [39] LOG:  checkpoint starting: force
2023-12-10 09:25:00.993 GMT [1] LOG:  database system is ready to accept connections
2023-12-10 09:25:01.030 GMT [39] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.010 s, sync=0.003 s, total=0.039 s; sync files=2, longest=0.002 s, average=0.002 s; distance=0 kB, estimate=0 kB; lsn=0/5022418, redo lsn=0/50223E0
2023-12-10 09:25:47.014 GMT [39] LOG:  checkpoint starting: force wait
2023-12-10 09:25:47.102 GMT [39] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 2 recycled; write=0.005 s, sync=0.001 s, total=0.089 s; sync files=0, longest=0.000 s, average=0.000 s; distance=16247 kB, estimate=16247 kB; lsn=0/6000060, redo lsn=0/6000028

And I can see the new replica pod was created since the pod 2 was made primary:

NAME             READY   STATUS    RESTARTS   AGE
mypostgres-2-0   1/1     Running   0          8m26s
mypostgres-3-0   1/1     Running   0          10m
mypostgres-4-0   1/1     Running   0          7m44s

Please let me know how more I can help. But I am not able to reproduce it. I have also renamed the ticket to something more meaningful. Like we say in the UK: Keep Calm and Carry On :)

alex-arica commented 11 months ago

Closing since I could not reproduce this issue.

letizia66 commented 8 months ago

Same problem here, solved changing SELinux mode to permissive.

reactive-tech / kubegres

Fail-over did not work for me #169